[whisper] Add OpenAI API compatibility

Signed-off-by: Gwendal Roulleau <gwendal.roulleau@gmail.com>
2025-01-10 15:11:59 +01:00 · 2024-12-17 17:17:48 +01:00 · 2024-12-17 17:17:48 +01:00 · e40473594a
commit e40473594a
parent 2f7b727d14
4 changed files with 287 additions and 93 deletions
--- a/bundles/org.openhab.voice.whisperstt/README.md
+++ b/bundles/org.openhab.voice.whisperstt/README.md
@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det
 [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
 Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API.
 Whisper enables speech recognition for multiple languages and dialects:
 english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo
 uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
 hausa, bashkir, javanese and sundanese.
-## Supported platforms
+## Local mode (offline)
-This add-on uses some native binaries to work.
+### Supported platforms
 This add-on uses some native binaries to work when performing offline recognition.
 You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
 The following platforms are supported:
@ -28,7 +32,7 @@ The following platforms are supported:
 The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
-## CPU compatibility
+### CPU compatibility
 To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
 The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`.
 If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
 You can check those flags on linux using the terminal with `lscpu`.
-## Transcription time
+### Transcription time
 On a Raspberry PI 5, the approximate transcription times are:
 | model      | exec time |
-| ---------- | --------: |
+|------------|----------:|
 | tiny.bin   |      1.5s |
 | base.bin   |        3s |
 | small.bin  |      8.5s |
 | medium.bin |       17s |
-## Configuring the model
+### Configuring the model
 Before you can use this service you should configure your model.
@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so
 Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
-## Using alternative whisper.cpp library
+### Using alternative whisper.cpp library
 It's possible to use your own build of the whisper.cpp shared library with this add-on.
@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi
 Note: You need to restart openHAB to reload the library.
-## Grammar
+### Grammar
 The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+
 You can provide the grammar and enable its usage using the binding configuration.
 ## API mode
 You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI).
 You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server.
 Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available.   
 ## Configuration
 Use your favorite configuration UI to edit the Whisper settings:
@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings:
 General options.
 - **Mode : LOCAL or API** - Choose either local computation or remote API use.
 - **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
 - **Preload Model** - Keep whisper model loaded.
 - **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
@ -139,6 +152,13 @@ Configure whisper options.
 - **Initial Prompt** - Initial prompt for whisper.
 - **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
 - **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
 - **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale.
 ### API Configuration
 - **API key** - Optional use of an API key for online services requiring it.
 - **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API.
 - **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'.
 ### Grammar Configuration
@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file
 Its contents should look similar to:
 ```ini
 org.openhab.voice.whisperstt:mode=LOCAL
 org.openhab.voice.whisperstt:modelName=tiny
 org.openhab.voice.whisperstt:language=en
 org.openhab.voice.whisperstt:initSilenceSeconds=0.3
 org.openhab.voice.whisperstt:removeSilence=true
 org.openhab.voice.whisperstt:stepSeconds=0.3
@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false
 org.openhab.voice.whisperstt:useGrammar=false
 org.openhab.voice.whisperstt:grammarPenalty=80.0
 org.openhab.voice.whisperstt:grammarLines=
 org.openhab.voice.whisperstt:apiKey=mykeyaaaa
 org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions
 org.openhab.voice.whisperstt:apiModelName=whisper-1
 ```
 ### Default Speech-to-Text Configuration
--- a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java
+++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java
@ -146,4 +146,29 @@ public class WhisperSTTConfiguration {
     * Print whisper.cpp library logs as binding debug logs.
     */
    public boolean enableWhisperLog;
    /**
     * local to use embedded whisper or openaiapi to use an external API
     */
    public Mode mode = Mode.LOCAL;
    /**
     * If mode set to openaiapi, then use this URL
     */
    public String apiUrl = "https://api.openai.com/v1/audio/transcriptions";
    /**
     * if mode set to openaiapi, use this api key to access apiUrl
     */
    public String apiKey = "";
    /**
     * If specified, speed up recognition by avoiding auto-detection
     */
    public String language = "";
    /**
     * Model name (API only)
     */
    public String apiModelName = "whisper-1";
    public static enum Mode {
        LOCAL,
        API;
    }
 }
--- a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java
+++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java
@ -12,12 +12,10 @@
 */
 package org.openhab.voice.whisperstt.internal;
-import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
+import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*;
 import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
 import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
 import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
 import java.io.ByteArrayInputStream;
 import java.io.ByteArrayOutputStream;
 import java.io.FileOutputStream;
 import java.io.IOException;
 import java.nio.ByteBuffer;
@ -32,7 +30,9 @@ import java.util.Date;
 import java.util.Locale;
 import java.util.Map;
 import java.util.Set;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.ScheduledExecutorService;
 import java.util.concurrent.TimeoutException;
 import java.util.concurrent.atomic.AtomicBoolean;
 import javax.sound.sampled.AudioFileFormat;
@ -41,6 +41,13 @@ import javax.sound.sampled.AudioSystem;
 import org.eclipse.jdt.annotation.NonNullByDefault;
 import org.eclipse.jdt.annotation.Nullable;
 import org.eclipse.jetty.client.HttpClient;
 import org.eclipse.jetty.client.api.ContentResponse;
 import org.eclipse.jetty.client.api.Request;
 import org.eclipse.jetty.client.util.InputStreamContentProvider;
 import org.eclipse.jetty.client.util.MultiPartContentProvider;
 import org.eclipse.jetty.client.util.StringContentProvider;
 import org.eclipse.jetty.http.HttpMethod;
 import org.openhab.core.OpenHAB;
 import org.openhab.core.audio.AudioFormat;
 import org.openhab.core.audio.AudioStream;
@ -48,6 +55,7 @@ import org.openhab.core.audio.utils.AudioWaveUtils;
 import org.openhab.core.common.ThreadPoolManager;
 import org.openhab.core.config.core.ConfigurableService;
 import org.openhab.core.config.core.Configuration;
 import org.openhab.core.io.net.http.HttpClientFactory;
 import org.openhab.core.io.rest.LocaleService;
 import org.openhab.core.voice.RecognitionStartEvent;
 import org.openhab.core.voice.RecognitionStopEvent;
@ -57,6 +65,7 @@ import org.openhab.core.voice.STTService;
 import org.openhab.core.voice.STTServiceHandle;
 import org.openhab.core.voice.SpeechRecognitionErrorEvent;
 import org.openhab.core.voice.SpeechRecognitionEvent;
 import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode;
 import org.openhab.voice.whisperstt.internal.utils.VAD;
 import org.osgi.framework.Constants;
 import org.osgi.service.component.annotations.Activate;
@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService {
    private @Nullable WhisperContext context;
    private @Nullable WhisperGrammar grammar;
    private @Nullable WhisperJNI whisper;
    private boolean isWhisperLibAlreadyLoaded = false;
    private final HttpClientFactory httpClientFactory;
    @Activate
-    public WhisperSTTService(@Reference LocaleService localeService) {
+    public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) {
        this.localeService = localeService;
        this.httpClientFactory = httpClientFactory;
    }
    @Activate
@ -108,7 +120,8 @@ public class WhisperSTTService implements STTService {
            if (!Files.exists(WHISPER_FOLDER)) {
                Files.createDirectory(WHISPER_FOLDER);
            }
-            WhisperJNI.loadLibrary(getLoadOptions());
+            this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
            loadWhisperLibraryIfNeeded();
            VoiceActivityDetector.loadLibrary();
            whisper = new WhisperJNI();
        } catch (IOException | RuntimeException e) {
@ -117,6 +130,13 @@ public class WhisperSTTService implements STTService {
        configChange(config);
    }
    private void loadWhisperLibraryIfNeeded() throws IOException {
        if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) {
            WhisperJNI.loadLibrary(getLoadOptions());
            isWhisperLibAlreadyLoaded = true;
        }
    }
    private WhisperJNI.LoadOptions getLoadOptions() {
        Path libFolder = Paths.get("/usr/local/lib");
        Path libFolderWin = Paths.get("/Windows/System32");
@ -167,14 +187,27 @@ public class WhisperSTTService implements STTService {
    private void configChange(Map<String, Object> config) {
        this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
        WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
        WhisperGrammar grammar = this.grammar;
        if (grammar != null) {
            grammar.close();
            this.grammar = null;
        }
        // API mode
        if (this.config.mode == Mode.API) {
            try {
                unloadContext();
            } catch (IOException e) {
                logger.warn("IOException unloading model: {}", e.getMessage());
            }
            return;
        }
        // Local mode
        WhisperJNI whisper;
        try {
            loadWhisperLibraryIfNeeded();
            WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
            whisper = getWhisper();
        } catch (IOException ignored) {
            logger.warn("library not loaded, the add-on will not work");
@ -228,9 +261,17 @@ public class WhisperSTTService implements STTService {
    @Override
    public Set<Locale> getSupportedLocales() {
-        // as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
+        // Attempt to create a locale from the configured language
-        // assumed the language of the model is matching the locale of the openHAB server
+        String language = config.language;
-        return Set.of(localeService.getLocale(null));
+        Locale modelLocale = localeService.getLocale(null);
        if (!language.isBlank()) {
            try {
                modelLocale = Locale.forLanguageTag(language);
            } catch (IllegalArgumentException e) {
                logger.warn("Invalid language '{}', defaulting to server locale", language);
            }
        }
        return Set.of(modelLocale);
    }
    @Override
@ -245,34 +286,20 @@ public class WhisperSTTService implements STTService {
    @Override
    public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
            throws STTException {
        AtomicBoolean aborted = new AtomicBoolean(false);
        WhisperContext ctx = null;
        WhisperState state = null;
        try {
            var whisper = getWhisper();
            ctx = getContext();
            logger.debug("Creating whisper state...");
            state = whisper.initState(ctx);
            logger.debug("Whisper state created");
            logger.debug("Creating VAD instance...");
-            final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
+            final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE);
            VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
                    config.vadStep, config.vadSensitivity);
            logger.debug("VAD instance created");
            sttListener.sttEventReceived(new RecognitionStartEvent());
-            backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
+            backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted);
        } catch (IOException e) {
            if (ctx != null && !config.preloadModel) {
                ctx.close();
            }
            if (state != null) {
                state.close();
            }
            throw new STTException("Exception during initialization", e);
        }
-        return () -> {
+        return () -> aborted.set(true);
            aborted.set(true);
        };
    }
    private WhisperJNI getWhisper() throws IOException {
@ -339,9 +366,8 @@ public class WhisperSTTService implements STTService {
        }
    }
-    private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
+    private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener,
-            Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
+            AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
        var releaseContext = !config.preloadModel;
        final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
        final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
        final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
@ -353,21 +379,17 @@ public class WhisperSTTService implements STTService {
        logger.debug("Max silence samples {}", nMaxSilenceSamples);
        // used to store the step samples in libfvad wanted format 16-bit int
        final short[] stepAudioSamples = new short[nSamplesStep];
-        // used to store the full samples in whisper wanted format 32-bit float
+        // used to store the full retained samples for whisper
-        final float[] audioSamples = new float[nSamplesMax];
+        final short[] audioSamples = new short[nSamplesMax];
        executor.submit(() -> {
            int audioSamplesOffset = 0;
            int silenceSamplesCounter = 0;
            int nProcessedSamples = 0;
            int numBytesRead;
            boolean voiceDetected = false;
            String transcription = "";
            String tempTranscription = "";
            VAD.@Nullable VADResult lastVADResult;
            VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
            try {
-                try (state; //
+                try (audioStream; //
                        audioStream; //
                        vad) {
                    if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
                        AudioWaveUtils.removeFMT(audioStream);
@ -376,10 +398,9 @@ public class WhisperSTTService implements STTService {
                            .order(ByteOrder.LITTLE_ENDIAN);
                    // init remaining to full capacity
                    int remaining = captureBuffer.capacity();
                    WhisperFullParams params = getWhisperFullParams(ctx, locale);
                    while (!aborted.get()) {
                        // read until no remaining so we get the complete step samples
-                        numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
+                        int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
                                remaining);
                        if (aborted.get() || numBytesRead == -1) {
                            break;
@ -395,17 +416,15 @@ public class WhisperSTTService implements STTService {
                        while (shortBuffer.hasRemaining()) {
                            var position = shortBuffer.position();
                            short i16BitSample = shortBuffer.get();
                            float f32BitSample = Float.min(1f,
                                    Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
                            stepAudioSamples[position] = i16BitSample;
-                            audioSamples[audioSamplesOffset++] = f32BitSample;
+                            audioSamples[audioSamplesOffset++] = i16BitSample;
                            nProcessedSamples++;
                        }
                        // run vad
                        if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
                            logger.debug("VAD: Skipping, max length reached");
                        } else {
-                            lastVADResult = vad.analyze(stepAudioSamples);
+                            VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples);
                            if (lastVADResult.isVoice()) {
                                voiceDetected = true;
                                logger.debug("VAD: voice detected");
@ -484,43 +503,26 @@ public class WhisperSTTService implements STTService {
                                }
                            }
                        }
-                        // run whisper
+                        // run whisper, either locally or by remote API
-                        logger.debug("running whisper with {} seconds of audio...",
+                        String tempTranscription = (switch (config.mode) {
-                                Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
+                            case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage());
-                        long execStartTime = System.currentTimeMillis();
+                            case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage());
-                        var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
+                        });
-                        logger.debug("whisper ended in {}ms with result code {}",
+
-                                System.currentTimeMillis() - execStartTime, result);
+                        if (tempTranscription != null && !tempTranscription.isBlank()) {
                        // process result
                        if (result != 0) {
                            emitSpeechRecognitionError(sttListener);
                            break;
                        }
                        int nSegments = whisper.fullNSegmentsFromState(state);
                        logger.debug("Available transcription segments {}", nSegments);
                        if (nSegments == 1) {
                            tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
                            if (config.createWAVRecord) {
                                createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
                                        locale.getLanguage());
                            }
                            transcription += tempTranscription;
                            if (config.singleUtteranceMode) {
                                logger.debug("single utterance mode, ending transcription");
                                transcription = tempTranscription;
                                break;
                            } else {
                                // start a new transcription segment
                                transcription += tempTranscription;
                                tempTranscription = "";
                            }
-                        } else if (nSegments == 0 && config.singleUtteranceMode) {
+                        } else {
                            logger.debug("Single utterance mode and no results, ending transcription");
                            break;
                        } else if (nSegments > 1) {
                            // non reachable
                            logger.warn("Whisper should be configured in single segment mode {}", nSegments);
                            break;
                        }
                        // reset state to start with next segment
                        voiceDetected = false;
                        silenceSamplesCounter = 0;
@ -528,10 +530,6 @@ public class WhisperSTTService implements STTService {
                        logger.debug("Partial transcription: {}", tempTranscription);
                        logger.debug("Transcription: {}", transcription);
                    }
                } finally {
                    if (releaseContext) {
                        ctx.close();
                    }
                }
                // emit result
                if (!aborted.get()) {
@ -543,7 +541,7 @@ public class WhisperSTTService implements STTService {
                        emitSpeechRecognitionNoResultsError(sttListener);
                    }
                }
-            } catch (IOException e) {
+            } catch (STTException | IOException e) {
                logger.warn("Error running speech to text: {}", e.getMessage());
                emitSpeechRecognitionError(sttListener);
            } catch (UnsatisfiedLinkError e) {
@ -553,7 +551,120 @@ public class WhisperSTTService implements STTService {
        });
    }
-    private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
+    @Nullable
    private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException {
        logger.debug("running whisper with {} seconds of audio...",
                Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
        var releaseContext = !config.preloadModel;
        WhisperJNI whisper = null;
        WhisperContext ctx = null;
        WhisperState state = null;
        try {
            whisper = getWhisper();
            ctx = getContext();
            logger.debug("Creating whisper state...");
            state = whisper.initState(ctx);
            logger.debug("Whisper state created");
            WhisperFullParams params = getWhisperFullParams(ctx, language);
            // convert to local whisper format (float)
            float[] floatArray = new float[audioSamples.length];
            for (int i = 0; i < audioSamples.length; i++) {
                floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f));
            }
            long execStartTime = System.currentTimeMillis();
            var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset);
            logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime,
                    result);
            // process result
            if (result != 0) {
                throw new STTException("Cannot use whisper locally, result code: " + result);
            }
            int nSegments = whisper.fullNSegmentsFromState(state);
            logger.debug("Available transcription segments {}", nSegments);
            if (nSegments == 1) {
                return whisper.fullGetSegmentTextFromState(state, 0);
            } else if (nSegments == 0 && config.singleUtteranceMode) {
                logger.debug("Single utterance mode and no results, ending transcription");
                return null;
            } else {
                // non reachable
                logger.warn("Whisper should be configured in single segment mode {}", nSegments);
                return null;
            }
        } catch (IOException e) {
            if (state != null) {
                state.close();
            }
            throw new STTException("Cannot use whisper locally", e);
        } finally {
            if (releaseContext && ctx != null) {
                ctx.close();
            }
        }
    }
    private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {
        // convert to byte array, Each short has 2 bytes
        int size = audioSamplesOffset * 2;
        ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN);
        for (int i = 0; i < audioSamplesOffset; i++) {
            byteArrayBuffer.putShort(audioStream[i]);
        }
        javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat(
                javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE,
                false);
        byte[] byteArray = byteArrayBuffer.array();
        try {
            AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat,
                    size);
            // write stream as a WAV file, in a byte array stream :
            ByteArrayInputStream byteArrayInputStream = null;
            try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
                AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos);
                byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray());
            }
            // prepare HTTP request
            HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient();
            MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider();
            multiPartContentProvider.addFilePart("file", "audio.wav",
                    new InputStreamContentProvider(byteArrayInputStream), null);
            multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null);
            multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null);
            multiPartContentProvider.addFieldPart("temperature",
                    new StringContentProvider(Float.toString(this.config.temperature)), null);
            if (!language.isBlank()) {
                multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null);
            }
            Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST)
                    .content(multiPartContentProvider);
            if (!config.apiKey.isBlank()) {
                request = request.header("Authorization", "Bearer " + config.apiKey);
            }
            // execute the request
            ContentResponse response = request.send();
            // check the HTTP status code from the response
            int statusCode = response.getStatus();
            if (statusCode < 200 || statusCode >= 300) {
                logger.debug("HTTP error: Received status code {}, full error is {}", statusCode,
                        response.getContentAsString());
                throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode);
            }
            return response.getContentAsString();
        } catch (InterruptedException | TimeoutException | ExecutionException | IOException e) {
            throw new STTException("Exception during attempt to get speech recognition result from api", e);
        }
    }
    private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException {
        WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
        var params = new WhisperFullParams(strategy);
        params.temperature = config.temperature;
@ -570,7 +681,7 @@ public class WhisperSTTService implements STTService {
            params.grammarPenalty = config.grammarPenalty;
        }
        // there is no single language models other than the english ones
-        params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
+        params.language = getWhisper().isMultilingual(context) ? language : "en";
        // implementation assumes this options
        params.translate = false;
        params.detectLanguage = false;
@ -605,7 +716,7 @@ public class WhisperSTTService implements STTService {
        }
    }
-    private void createAudioFile(float[] samples, int size, String transcription, String language) {
+    private void createAudioFile(short[] samples, int size, String transcription, String language) {
        createSamplesDir();
        javax.sound.sampled.AudioFormat jAudioFormat;
        ByteBuffer byteBuffer;
@ -615,7 +726,7 @@ public class WhisperSTTService implements STTService {
                    WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
            byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
            for (int i = 0; i < size; i++) {
-                byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
+                byteBuffer.putShort(samples[i]);
            }
        } else {
            logger.debug("Saving audio file with sample format f32");
@ -623,7 +734,7 @@ public class WhisperSTTService implements STTService {
                    WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
            byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
            for (int i = 0; i < size; i++) {
-                byteBuffer.putFloat(samples[i]);
+                byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f)));
            }
        }
        AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
--- a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml
+++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml
@ -11,7 +11,7 @@
 		</parameter-group>
 		<parameter-group name="vad">
 			<label>Voice Activity Detection</label>
-			<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
+			<description>Configure the VAD mechanism used to isolate single phrases to feed whisper with.</description>
 		</parameter-group>
 		<parameter-group name="whisper">
 			<label>Whisper Options</label>
@ -19,7 +19,7 @@
 		</parameter-group>
 		<parameter-group name="grammar">
 			<label>Grammar</label>
-			<description>Define a grammar to improve transcrptions.</description>
+			<description>Define a grammar to improve transcriptions.</description>
 		</parameter-group>
 		<parameter-group name="messages">
 			<label>Info Messages</label>
@ -30,9 +30,27 @@
 			<description>Options added for developers.</description>
 			<advanced>true</advanced>
 		</parameter-group>
 		<parameter-group name="openaiapi">
 			<label>API Configuration Options</label>
 			<description>Configure OpenAI compatible API, if you don't want to use the local model.</description>
 		</parameter-group>
 		<parameter name="mode" type="text" groupName="stt">
 			<label>Local Mode Or API</label>
 			<description>Use the local model or the OpenAI compatible API.</description>
 			<default>LOCAL</default>
 			<options>
 				<option value="LOCAL">Local</option>
 				<option value="API">OpenAI API</option>
 			</options>
 		</parameter>
 		<parameter name="modelName" type="text" groupName="stt" required="true">
-			<label>Model Name</label>
+			<label>Local Model Name</label>
-			<description>Model name without extension.</description>
+			<description>Model name without extension. Local mode only.</description>
 		</parameter>
 		<parameter name="language" type="text" groupName="whisper">
 			<label>Language</label>
 			<description>If specified, speed up recognition by avoiding auto-detection. Default to system locale.</description>
 			<default></default>
 		</parameter>
 		<parameter name="preloadModel" type="boolean" groupName="stt">
 			<label>Preload Model</label>
@ -225,5 +243,20 @@
 			<default>false</default>
 			<advanced>true</advanced>
 		</parameter>
 		<parameter name="apiKey" type="text" groupName="openaiapi">
 			<label>API Key</label>
 			<description>Key to access the API</description>
 			<default></default>
 		</parameter>
 		<parameter name="apiUrl" type="text" groupName="openaiapi">
 			<label>API Url</label>
 			<description>OpenAI compatible API URL. Default to OpenAI transcription service.</description>
 			<default>https://api.openai.com/v1/audio/transcriptions</default>
 		</parameter>
 		<parameter name="apiModelName" type="text" groupName="openaiapi">
 			<label>API Model</label>
 			<description>Model name to use (API only). Default to OpenAI only available model (whisper-1).</description>
 			<default>whisper-1</default>
 		</parameter>
 	</config-description>
 </config-description:config-descriptions>