Compare commits

...

3 Commits

Author SHA1 Message Date
Gwendal Roulleau
fe8f06aa09
Merge 5487ef17bc into adacdebb9f 2025-01-08 23:13:42 +01:00
Gwendal Roulleau
5487ef17bc [whisper] Add OpenAI API compatibility
Apply PR comments

Signed-off-by: Gwendal Roulleau <gwendal.roulleau@gmail.com>
2024-12-30 11:47:42 +01:00
Gwendal Roulleau
e40473594a [whisper] Add OpenAI API compatibility
Signed-off-by: Gwendal Roulleau <gwendal.roulleau@gmail.com>
2024-12-17 23:04:36 +01:00
5 changed files with 303 additions and 97 deletions

View File

@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API.
Whisper enables speech recognition for multiple languages and dialects:
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
hausa, bashkir, javanese and sundanese.
## Supported platforms
## Local mode (offline)
This add-on uses some native binaries to work.
### Supported platforms
This add-on uses some native binaries to work when performing offline recognition.
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
The following platforms are supported:
@ -28,7 +32,7 @@ The following platforms are supported:
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
## CPU compatibility
### CPU compatibility
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`.
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
You can check those flags on linux using the terminal with `lscpu`.
## Transcription time
### Transcription time
On a Raspberry PI 5, the approximate transcription times are:
| model | exec time |
| ---------- | --------: |
|------------|----------:|
| tiny.bin | 1.5s |
| base.bin | 3s |
| small.bin | 8.5s |
| medium.bin | 17s |
## Configuring the model
### Configuring the model
Before you can use this service you should configure your model.
@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
## Using alternative whisper.cpp library
### Using alternative whisper.cpp library
It's possible to use your own build of the whisper.cpp shared library with this add-on.
@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi
Note: You need to restart openHAB to reload the library.
## Grammar
### Grammar
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+
You can provide the grammar and enable its usage using the binding configuration.
## API mode
You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI).
You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server.
Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available.
## Configuration
Use your favorite configuration UI to edit the Whisper settings:
@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings:
General options.
- **Mode : LOCAL or API** - Choose either local computation or remote API use.
- **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
- **Preload Model** - Keep whisper model loaded.
- **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
@ -139,6 +152,13 @@ Configure whisper options.
- **Initial Prompt** - Initial prompt for whisper.
- **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
- **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
- **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale.
### API Configuration
- **API key** - Optional use of an API key for online services requiring it.
- **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API.
- **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'.
### Grammar Configuration
@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file
Its contents should look similar to:
```ini
org.openhab.voice.whisperstt:mode=LOCAL
org.openhab.voice.whisperstt:modelName=tiny
org.openhab.voice.whisperstt:language=en
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
org.openhab.voice.whisperstt:removeSilence=true
org.openhab.voice.whisperstt:stepSeconds=0.3
@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false
org.openhab.voice.whisperstt:useGrammar=false
org.openhab.voice.whisperstt:grammarPenalty=80.0
org.openhab.voice.whisperstt:grammarLines=
org.openhab.voice.whisperstt:apiKey=mykeyaaaa
org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions
org.openhab.voice.whisperstt:apiModelName=whisper-1
```
### Default Speech-to-Text Configuration

View File

@ -146,4 +146,29 @@ public class WhisperSTTConfiguration {
* Print whisper.cpp library logs as binding debug logs.
*/
public boolean enableWhisperLog;
/**
* local to use embedded whisper or openaiapi to use an external API
*/
public Mode mode = Mode.LOCAL;
/**
* If mode set to openaiapi, then use this URL
*/
public String apiUrl = "https://api.openai.com/v1/audio/transcriptions";
/**
* if mode set to openaiapi, use this api key to access apiUrl
*/
public String apiKey = "";
/**
* If specified, speed up recognition by avoiding auto-detection
*/
public String language = "";
/**
* Model name (API only)
*/
public String apiModelName = "whisper-1";
public static enum Mode {
LOCAL,
API;
}
}

View File

@ -12,12 +12,10 @@
*/
package org.openhab.voice.whisperstt.internal;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
@ -32,7 +30,9 @@ import java.util.Date;
import java.util.Locale;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeoutException;
import java.util.concurrent.atomic.AtomicBoolean;
import javax.sound.sampled.AudioFileFormat;
@ -41,6 +41,13 @@ import javax.sound.sampled.AudioSystem;
import org.eclipse.jdt.annotation.NonNullByDefault;
import org.eclipse.jdt.annotation.Nullable;
import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;
import org.eclipse.jetty.client.api.Request;
import org.eclipse.jetty.client.util.InputStreamContentProvider;
import org.eclipse.jetty.client.util.MultiPartContentProvider;
import org.eclipse.jetty.client.util.StringContentProvider;
import org.eclipse.jetty.http.HttpMethod;
import org.openhab.core.OpenHAB;
import org.openhab.core.audio.AudioFormat;
import org.openhab.core.audio.AudioStream;
@ -48,6 +55,7 @@ import org.openhab.core.audio.utils.AudioWaveUtils;
import org.openhab.core.common.ThreadPoolManager;
import org.openhab.core.config.core.ConfigurableService;
import org.openhab.core.config.core.Configuration;
import org.openhab.core.io.net.http.HttpClientFactory;
import org.openhab.core.io.rest.LocaleService;
import org.openhab.core.voice.RecognitionStartEvent;
import org.openhab.core.voice.RecognitionStopEvent;
@ -57,6 +65,7 @@ import org.openhab.core.voice.STTService;
import org.openhab.core.voice.STTServiceHandle;
import org.openhab.core.voice.SpeechRecognitionErrorEvent;
import org.openhab.core.voice.SpeechRecognitionEvent;
import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode;
import org.openhab.voice.whisperstt.internal.utils.VAD;
import org.osgi.framework.Constants;
import org.osgi.service.component.annotations.Activate;
@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService {
private @Nullable WhisperContext context;
private @Nullable WhisperGrammar grammar;
private @Nullable WhisperJNI whisper;
private boolean isWhisperLibAlreadyLoaded = false;
private final HttpClientFactory httpClientFactory;
@Activate
public WhisperSTTService(@Reference LocaleService localeService) {
public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) {
this.localeService = localeService;
this.httpClientFactory = httpClientFactory;
}
@Activate
@ -108,7 +120,8 @@ public class WhisperSTTService implements STTService {
if (!Files.exists(WHISPER_FOLDER)) {
Files.createDirectory(WHISPER_FOLDER);
}
WhisperJNI.loadLibrary(getLoadOptions());
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
loadWhisperLibraryIfNeeded();
VoiceActivityDetector.loadLibrary();
whisper = new WhisperJNI();
} catch (IOException | RuntimeException e) {
@ -117,6 +130,13 @@ public class WhisperSTTService implements STTService {
configChange(config);
}
private void loadWhisperLibraryIfNeeded() throws IOException {
if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) {
WhisperJNI.loadLibrary(getLoadOptions());
isWhisperLibAlreadyLoaded = true;
}
}
private WhisperJNI.LoadOptions getLoadOptions() {
Path libFolder = Paths.get("/usr/local/lib");
Path libFolderWin = Paths.get("/Windows/System32");
@ -167,14 +187,27 @@ public class WhisperSTTService implements STTService {
private void configChange(Map<String, Object> config) {
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
WhisperGrammar grammar = this.grammar;
if (grammar != null) {
grammar.close();
this.grammar = null;
}
// API mode
if (this.config.mode == Mode.API) {
try {
unloadContext();
} catch (IOException e) {
logger.warn("IOException unloading model: {}", e.getMessage());
}
return;
}
// Local mode
WhisperJNI whisper;
try {
loadWhisperLibraryIfNeeded();
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
whisper = getWhisper();
} catch (IOException ignored) {
logger.warn("library not loaded, the add-on will not work");
@ -228,9 +261,17 @@ public class WhisperSTTService implements STTService {
@Override
public Set<Locale> getSupportedLocales() {
// as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
// assumed the language of the model is matching the locale of the openHAB server
return Set.of(localeService.getLocale(null));
// Attempt to create a locale from the configured language
String language = config.language;
Locale modelLocale = localeService.getLocale(null);
if (!language.isBlank()) {
try {
modelLocale = Locale.forLanguageTag(language);
} catch (IllegalArgumentException e) {
logger.warn("Invalid language '{}', defaulting to server locale", language);
}
}
return Set.of(modelLocale);
}
@Override
@ -246,33 +287,18 @@ public class WhisperSTTService implements STTService {
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
throws STTException {
AtomicBoolean aborted = new AtomicBoolean(false);
WhisperContext ctx = null;
WhisperState state = null;
try {
var whisper = getWhisper();
ctx = getContext();
logger.debug("Creating whisper state...");
state = whisper.initState(ctx);
logger.debug("Whisper state created");
logger.debug("Creating VAD instance...");
final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE);
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
config.vadStep, config.vadSensitivity);
logger.debug("VAD instance created");
sttListener.sttEventReceived(new RecognitionStartEvent());
backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted);
} catch (IOException e) {
if (ctx != null && !config.preloadModel) {
ctx.close();
}
if (state != null) {
state.close();
}
throw new STTException("Exception during initialization", e);
}
return () -> {
aborted.set(true);
};
return () -> aborted.set(true);
}
private WhisperJNI getWhisper() throws IOException {
@ -339,9 +365,8 @@ public class WhisperSTTService implements STTService {
}
}
private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
var releaseContext = !config.preloadModel;
private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener,
AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
@ -353,21 +378,17 @@ public class WhisperSTTService implements STTService {
logger.debug("Max silence samples {}", nMaxSilenceSamples);
// used to store the step samples in libfvad wanted format 16-bit int
final short[] stepAudioSamples = new short[nSamplesStep];
// used to store the full samples in whisper wanted format 32-bit float
final float[] audioSamples = new float[nSamplesMax];
// used to store the full retained samples for whisper
final short[] audioSamples = new short[nSamplesMax];
executor.submit(() -> {
int audioSamplesOffset = 0;
int silenceSamplesCounter = 0;
int nProcessedSamples = 0;
int numBytesRead;
boolean voiceDetected = false;
String transcription = "";
String tempTranscription = "";
VAD.@Nullable VADResult lastVADResult;
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
try {
try (state; //
audioStream; //
try (audioStream; //
vad) {
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
AudioWaveUtils.removeFMT(audioStream);
@ -376,10 +397,9 @@ public class WhisperSTTService implements STTService {
.order(ByteOrder.LITTLE_ENDIAN);
// init remaining to full capacity
int remaining = captureBuffer.capacity();
WhisperFullParams params = getWhisperFullParams(ctx, locale);
while (!aborted.get()) {
// read until no remaining so we get the complete step samples
numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
remaining);
if (aborted.get() || numBytesRead == -1) {
break;
@ -395,17 +415,15 @@ public class WhisperSTTService implements STTService {
while (shortBuffer.hasRemaining()) {
var position = shortBuffer.position();
short i16BitSample = shortBuffer.get();
float f32BitSample = Float.min(1f,
Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
stepAudioSamples[position] = i16BitSample;
audioSamples[audioSamplesOffset++] = f32BitSample;
audioSamples[audioSamplesOffset++] = i16BitSample;
nProcessedSamples++;
}
// run vad
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
logger.debug("VAD: Skipping, max length reached");
} else {
lastVADResult = vad.analyze(stepAudioSamples);
VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples);
if (lastVADResult.isVoice()) {
voiceDetected = true;
logger.debug("VAD: voice detected");
@ -484,43 +502,26 @@ public class WhisperSTTService implements STTService {
}
}
}
// run whisper
logger.debug("running whisper with {} seconds of audio...",
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
long execStartTime = System.currentTimeMillis();
var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
logger.debug("whisper ended in {}ms with result code {}",
System.currentTimeMillis() - execStartTime, result);
// process result
if (result != 0) {
emitSpeechRecognitionError(sttListener);
break;
}
int nSegments = whisper.fullNSegmentsFromState(state);
logger.debug("Available transcription segments {}", nSegments);
if (nSegments == 1) {
tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
// run whisper, either locally or by remote API
String tempTranscription = (switch (config.mode) {
case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage());
case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage());
});
if (tempTranscription != null && !tempTranscription.isBlank()) {
if (config.createWAVRecord) {
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
locale.getLanguage());
}
transcription += tempTranscription;
if (config.singleUtteranceMode) {
logger.debug("single utterance mode, ending transcription");
transcription = tempTranscription;
break;
}
} else {
// start a new transcription segment
transcription += tempTranscription;
tempTranscription = "";
}
} else if (nSegments == 0 && config.singleUtteranceMode) {
logger.debug("Single utterance mode and no results, ending transcription");
break;
} else if (nSegments > 1) {
// non reachable
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
break;
}
// reset state to start with next segment
voiceDetected = false;
silenceSamplesCounter = 0;
@ -528,10 +529,6 @@ public class WhisperSTTService implements STTService {
logger.debug("Partial transcription: {}", tempTranscription);
logger.debug("Transcription: {}", transcription);
}
} finally {
if (releaseContext) {
ctx.close();
}
}
// emit result
if (!aborted.get()) {
@ -543,7 +540,7 @@ public class WhisperSTTService implements STTService {
emitSpeechRecognitionNoResultsError(sttListener);
}
}
} catch (IOException e) {
} catch (STTException | IOException e) {
logger.warn("Error running speech to text: {}", e.getMessage());
emitSpeechRecognitionError(sttListener);
} catch (UnsatisfiedLinkError e) {
@ -553,7 +550,119 @@ public class WhisperSTTService implements STTService {
});
}
private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
@Nullable
private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException {
logger.debug("running whisper with {} seconds of audio...",
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
var releaseContext = !config.preloadModel;
WhisperJNI whisper = null;
WhisperContext ctx = null;
WhisperState state = null;
try {
whisper = getWhisper();
ctx = getContext();
logger.debug("Creating whisper state...");
state = whisper.initState(ctx);
logger.debug("Whisper state created");
WhisperFullParams params = getWhisperFullParams(ctx, language);
// convert to local whisper format (float)
float[] floatArray = new float[audioSamples.length];
for (int i = 0; i < audioSamples.length; i++) {
floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f));
}
long execStartTime = System.currentTimeMillis();
var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset);
logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime,
result);
// process result
if (result != 0) {
throw new STTException("Cannot use whisper locally, result code: " + result);
}
int nSegments = whisper.fullNSegmentsFromState(state);
logger.debug("Available transcription segments {}", nSegments);
if (nSegments == 1) {
return whisper.fullGetSegmentTextFromState(state, 0);
} else if (nSegments == 0 && config.singleUtteranceMode) {
logger.debug("Single utterance mode and no results, ending transcription");
return null;
} else {
// non reachable
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
return null;
}
} catch (IOException e) {
if (state != null) {
state.close();
}
throw new STTException("Cannot use whisper locally", e);
} finally {
if (releaseContext && ctx != null) {
ctx.close();
}
}
}
private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {
// convert to byte array, Each short has 2 bytes
int size = audioSamplesOffset * 2;
ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < audioSamplesOffset; i++) {
byteArrayBuffer.putShort(audioStream[i]);
}
javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat(
javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE,
false);
byte[] byteArray = byteArrayBuffer.array();
try {
AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat,
audioSamplesOffset);
// write stream as a WAV file, in a byte array stream :
ByteArrayInputStream byteArrayInputStream = null;
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos);
byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray());
}
// prepare HTTP request
HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient();
MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider();
multiPartContentProvider.addFilePart("file", "audio.wav",
new InputStreamContentProvider(byteArrayInputStream), null);
multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null);
multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null);
multiPartContentProvider.addFieldPart("temperature",
new StringContentProvider(Float.toString(this.config.temperature)), null);
if (!language.isBlank()) {
multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null);
}
Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST)
.content(multiPartContentProvider);
if (!config.apiKey.isBlank()) {
request = request.header("Authorization", "Bearer " + config.apiKey);
}
// execute the request
ContentResponse response = request.send();
// check the HTTP status code from the response
int statusCode = response.getStatus();
if (statusCode < 200 || statusCode >= 300) {
logger.debug("HTTP error: Received status code {}, full error is {}", statusCode,
response.getContentAsString());
throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode);
}
return response.getContentAsString();
} catch (InterruptedException | TimeoutException | ExecutionException | IOException e) {
throw new STTException("Exception during attempt to get speech recognition result from api", e);
}
}
private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException {
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
var params = new WhisperFullParams(strategy);
params.temperature = config.temperature;
@ -570,7 +679,7 @@ public class WhisperSTTService implements STTService {
params.grammarPenalty = config.grammarPenalty;
}
// there is no single language models other than the english ones
params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
params.language = getWhisper().isMultilingual(context) ? language : "en";
// implementation assumes this options
params.translate = false;
params.detectLanguage = false;
@ -605,7 +714,7 @@ public class WhisperSTTService implements STTService {
}
}
private void createAudioFile(float[] samples, int size, String transcription, String language) {
private void createAudioFile(short[] samples, int size, String transcription, String language) {
createSamplesDir();
javax.sound.sampled.AudioFormat jAudioFormat;
ByteBuffer byteBuffer;
@ -615,7 +724,7 @@ public class WhisperSTTService implements STTService {
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < size; i++) {
byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
byteBuffer.putShort(samples[i]);
}
} else {
logger.debug("Saving audio file with sample format f32");
@ -623,7 +732,7 @@ public class WhisperSTTService implements STTService {
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < size; i++) {
byteBuffer.putFloat(samples[i]);
byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f)));
}
}
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),

View File

@ -11,7 +11,7 @@
</parameter-group>
<parameter-group name="vad">
<label>Voice Activity Detection</label>
<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
<description>Configure the VAD mechanism used to isolate single phrases to feed whisper with.</description>
</parameter-group>
<parameter-group name="whisper">
<label>Whisper Options</label>
@ -19,7 +19,7 @@
</parameter-group>
<parameter-group name="grammar">
<label>Grammar</label>
<description>Define a grammar to improve transcrptions.</description>
<description>Define a grammar to improve transcriptions.</description>
</parameter-group>
<parameter-group name="messages">
<label>Info Messages</label>
@ -30,9 +30,27 @@
<description>Options added for developers.</description>
<advanced>true</advanced>
</parameter-group>
<parameter-group name="openaiapi">
<label>API Configuration Options</label>
<description>Configure OpenAI compatible API, if you don't want to use the local model.</description>
</parameter-group>
<parameter name="mode" type="text" groupName="stt">
<label>Local Mode Or API</label>
<description>Use the local model or the OpenAI compatible API.</description>
<default>LOCAL</default>
<options>
<option value="LOCAL">Local</option>
<option value="API">OpenAI API</option>
</options>
</parameter>
<parameter name="modelName" type="text" groupName="stt" required="true">
<label>Model Name</label>
<description>Model name without extension.</description>
<label>Local Model Name</label>
<description>Model name without extension. Local mode only.</description>
</parameter>
<parameter name="language" type="text" groupName="whisper">
<label>Language</label>
<description>If specified, speed up recognition by avoiding auto-detection. Default to system locale.</description>
<default></default>
</parameter>
<parameter name="preloadModel" type="boolean" groupName="stt">
<label>Preload Model</label>
@ -225,5 +243,20 @@
<default>false</default>
<advanced>true</advanced>
</parameter>
<parameter name="apiKey" type="text" groupName="openaiapi">
<label>API Key</label>
<description>Key to access the API</description>
<default></default>
</parameter>
<parameter name="apiUrl" type="text" groupName="openaiapi">
<label>API Url</label>
<description>OpenAI compatible API URL. Default to OpenAI transcription service.</description>
<default>https://api.openai.com/v1/audio/transcriptions</default>
</parameter>
<parameter name="apiModelName" type="text" groupName="openaiapi">
<label>API Model</label>
<description>Model name to use (API only). Default to OpenAI only available model (whisper-1).</description>
<default>whisper-1</default>
</parameter>
</config-description>
</config-description:config-descriptions>

View File

@ -3,6 +3,12 @@
addon.whisperstt.name = Whisper Speech-to-Text
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
voice.config.whisperstt.apiKey.label = API Key
voice.config.whisperstt.apiKey.description = Key to access the API
voice.config.whisperstt.apiModelName.label = API Model
voice.config.whisperstt.apiModelName.description = Model name to use (API only). Default to OpenAI only available model (whisper-1).
voice.config.whisperstt.apiUrl.label = API Url
voice.config.whisperstt.apiUrl.description = OpenAI compatible API URL. Default to OpenAI transcription service.
voice.config.whisperstt.audioContext.label = Audio Context
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
voice.config.whisperstt.beamSize.label = Beam Size
@ -24,27 +30,35 @@ voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sam
voice.config.whisperstt.group.developer.label = Developer
voice.config.whisperstt.group.developer.description = Options added for developers.
voice.config.whisperstt.group.grammar.label = Grammar
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions.
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcriptions.
voice.config.whisperstt.group.messages.label = Info Messages
voice.config.whisperstt.group.messages.description = Configure service information messages.
voice.config.whisperstt.group.openaiapi.label = API Configuration Options
voice.config.whisperstt.group.openaiapi.description = Configure OpenAI compatible API, if you don't want to use the local model.
voice.config.whisperstt.group.stt.label = STT Configuration
voice.config.whisperstt.group.stt.description = Configure Speech to Text.
voice.config.whisperstt.group.vad.label = Voice Activity Detection
voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with.
voice.config.whisperstt.group.vad.description = Configure the VAD mechanism used to isolate single phrases to feed whisper with.
voice.config.whisperstt.group.whisper.label = Whisper Options
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
voice.config.whisperstt.initialPrompt.label = Initial Prompt
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
voice.config.whisperstt.language.label = Language
voice.config.whisperstt.language.description = If specified, speed up recognition by avoiding auto-detection. Default to system locale.
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
voice.config.whisperstt.modelName.label = Model Name
voice.config.whisperstt.modelName.description = Model name without extension.
voice.config.whisperstt.mode.label = Local Mode Or API
voice.config.whisperstt.mode.description = Use the local model or the OpenAI compatible API.
voice.config.whisperstt.mode.option.LOCAL = Local
voice.config.whisperstt.mode.option.API = OpenAI API
voice.config.whisperstt.modelName.label = Local Model Name
voice.config.whisperstt.modelName.description = Model name without extension. Local mode only.
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
voice.config.whisperstt.preloadModel.label = Preload Model