[whisper] Add OpenAI API compatibility

Signed-off-by: Gwendal Roulleau <gwendal.roulleau@gmail.com>
2025-01-10 15:11:59 +01:00 · 2024-12-17 17:17:48 +01:00 · 2024-12-17 17:17:48 +01:00 · e40473594a
commit e40473594a
parent 2f7b727d14
4 changed files with 287 additions and 93 deletions
--- a/bundles/org.openhab.voice.whisperstt/README.md
+++ b/bundles/org.openhab.voice.whisperstt/README.md
@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det

 [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.

+Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API.
+
 Whisper enables speech recognition for multiple languages and dialects:

 english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo
 uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
 hausa, bashkir, javanese and sundanese.

-## Supported platforms
+## Local mode (offline)

-This add-on uses some native binaries to work.
+### Supported platforms
+
+This add-on uses some native binaries to work when performing offline recognition.
 You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).

 The following platforms are supported:
@ -28,7 +32,7 @@ The following platforms are supported:

 The native binaries for those platforms are included in this add-on provided with the openHAB distribution.

-## CPU compatibility
+### CPU compatibility

 To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
 The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`.
 If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
 You can check those flags on linux using the terminal with `lscpu`.

-## Transcription time
+### Transcription time

 On a Raspberry PI 5, the approximate transcription times are:

 | model      | exec time |
-| ---------- | --------: |
+|------------|----------:|
 | tiny.bin   |      1.5s |
 | base.bin   |        3s |
 | small.bin  |      8.5s |
 | medium.bin |       17s |

-## Configuring the model
+### Configuring the model

 Before you can use this service you should configure your model.

@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so

 Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.

-## Using alternative whisper.cpp library
+### Using alternative whisper.cpp library

 It's possible to use your own build of the whisper.cpp shared library with this add-on.

@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi

 Note: You need to restart openHAB to reload the library.

-## Grammar
+### Grammar

 The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.

@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+

 You can provide the grammar and enable its usage using the binding configuration.

+## API mode
+
+You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI).
+
+You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server.
+
+Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available.   
+
 ## Configuration

 Use your favorite configuration UI to edit the Whisper settings:
@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings:

 General options.

+- **Mode : LOCAL or API** - Choose either local computation or remote API use.
 - **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
 - **Preload Model** - Keep whisper model loaded.
 - **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
@ -139,6 +152,13 @@ Configure whisper options.
 - **Initial Prompt** - Initial prompt for whisper.
 - **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
 - **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
+- **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale.
+
+### API Configuration
+
+- **API key** - Optional use of an API key for online services requiring it.
+- **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API.
+- **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'.

 ### Grammar Configuration

@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file
 Its contents should look similar to:

 ```ini
+org.openhab.voice.whisperstt:mode=LOCAL
 org.openhab.voice.whisperstt:modelName=tiny
+org.openhab.voice.whisperstt:language=en
 org.openhab.voice.whisperstt:initSilenceSeconds=0.3
 org.openhab.voice.whisperstt:removeSilence=true
 org.openhab.voice.whisperstt:stepSeconds=0.3
@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false
 org.openhab.voice.whisperstt:useGrammar=false
 org.openhab.voice.whisperstt:grammarPenalty=80.0
 org.openhab.voice.whisperstt:grammarLines=
+org.openhab.voice.whisperstt:apiKey=mykeyaaaa
+org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions
+org.openhab.voice.whisperstt:apiModelName=whisper-1
 ```

 ### Default Speech-to-Text Configuration
--- a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java
+++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java
@ -146,4 +146,29 @@ public class WhisperSTTConfiguration {
     * Print whisper.cpp library logs as binding debug logs.
     */
    public boolean enableWhisperLog;
+    /**
+     * local to use embedded whisper or openaiapi to use an external API
+     */
+    public Mode mode = Mode.LOCAL;
+    /**
+     * If mode set to openaiapi, then use this URL
+     */
+    public String apiUrl = "https://api.openai.com/v1/audio/transcriptions";
+    /**
+     * if mode set to openaiapi, use this api key to access apiUrl
+     */
+    public String apiKey = "";
+    /**
+     * If specified, speed up recognition by avoiding auto-detection
+     */
+    public String language = "";
+    /**
+     * Model name (API only)
+     */
+    public String apiModelName = "whisper-1";
+
+    public static enum Mode {
+        LOCAL,
+        API;
+    }
 }
--- a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java
+++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java
@ -12,12 +12,10 @@
 */
 package org.openhab.voice.whisperstt.internal;

-import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
-import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
-import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
-import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
+import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*;

 import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
 import java.io.FileOutputStream;
 import java.io.IOException;
 import java.nio.ByteBuffer;
@ -32,7 +30,9 @@ import java.util.Date;
 import java.util.Locale;
 import java.util.Map;
 import java.util.Set;
+import java.util.concurrent.ExecutionException;
 import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeoutException;
 import java.util.concurrent.atomic.AtomicBoolean;

 import javax.sound.sampled.AudioFileFormat;
@ -41,6 +41,13 @@ import javax.sound.sampled.AudioSystem;

 import org.eclipse.jdt.annotation.NonNullByDefault;
 import org.eclipse.jdt.annotation.Nullable;
+import org.eclipse.jetty.client.HttpClient;
+import org.eclipse.jetty.client.api.ContentResponse;
+import org.eclipse.jetty.client.api.Request;
+import org.eclipse.jetty.client.util.InputStreamContentProvider;
+import org.eclipse.jetty.client.util.MultiPartContentProvider;
+import org.eclipse.jetty.client.util.StringContentProvider;
+import org.eclipse.jetty.http.HttpMethod;
 import org.openhab.core.OpenHAB;
 import org.openhab.core.audio.AudioFormat;
 import org.openhab.core.audio.AudioStream;
@ -48,6 +55,7 @@ import org.openhab.core.audio.utils.AudioWaveUtils;
 import org.openhab.core.common.ThreadPoolManager;
 import org.openhab.core.config.core.ConfigurableService;
 import org.openhab.core.config.core.Configuration;
+import org.openhab.core.io.net.http.HttpClientFactory;
 import org.openhab.core.io.rest.LocaleService;
 import org.openhab.core.voice.RecognitionStartEvent;
 import org.openhab.core.voice.RecognitionStopEvent;
@ -57,6 +65,7 @@ import org.openhab.core.voice.STTService;
 import org.openhab.core.voice.STTServiceHandle;
 import org.openhab.core.voice.SpeechRecognitionErrorEvent;
 import org.openhab.core.voice.SpeechRecognitionEvent;
+import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode;
 import org.openhab.voice.whisperstt.internal.utils.VAD;
 import org.osgi.framework.Constants;
 import org.osgi.service.component.annotations.Activate;
@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService {
    private @Nullable WhisperContext context;
    private @Nullable WhisperGrammar grammar;
    private @Nullable WhisperJNI whisper;
+    private boolean isWhisperLibAlreadyLoaded = false;
+    private final HttpClientFactory httpClientFactory;

    @Activate
-    public WhisperSTTService(@Reference LocaleService localeService) {
+    public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) {
        this.localeService = localeService;
+        this.httpClientFactory = httpClientFactory;
    }

    @Activate
@ -108,7 +120,8 @@ public class WhisperSTTService implements STTService {
            if (!Files.exists(WHISPER_FOLDER)) {
                Files.createDirectory(WHISPER_FOLDER);
            }
-            WhisperJNI.loadLibrary(getLoadOptions());
+            this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
+            loadWhisperLibraryIfNeeded();
            VoiceActivityDetector.loadLibrary();
            whisper = new WhisperJNI();
        } catch (IOException | RuntimeException e) {
@ -117,6 +130,13 @@ public class WhisperSTTService implements STTService {
        configChange(config);
    }

+    private void loadWhisperLibraryIfNeeded() throws IOException {
+        if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) {
+            WhisperJNI.loadLibrary(getLoadOptions());
+            isWhisperLibAlreadyLoaded = true;
+        }
+    }
+
    private WhisperJNI.LoadOptions getLoadOptions() {
        Path libFolder = Paths.get("/usr/local/lib");
        Path libFolderWin = Paths.get("/Windows/System32");
@ -167,14 +187,27 @@ public class WhisperSTTService implements STTService {

    private void configChange(Map<String, Object> config) {
        this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
-        WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
        WhisperGrammar grammar = this.grammar;
        if (grammar != null) {
            grammar.close();
            this.grammar = null;
        }
+
+        // API mode
+        if (this.config.mode == Mode.API) {
+            try {
+                unloadContext();
+            } catch (IOException e) {
+                logger.warn("IOException unloading model: {}", e.getMessage());
+            }
+            return;
+        }
+
+        // Local mode
        WhisperJNI whisper;
        try {
+            loadWhisperLibraryIfNeeded();
+            WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
            whisper = getWhisper();
        } catch (IOException ignored) {
            logger.warn("library not loaded, the add-on will not work");
@ -228,9 +261,17 @@ public class WhisperSTTService implements STTService {

    @Override
    public Set<Locale> getSupportedLocales() {
-        // as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
-        // assumed the language of the model is matching the locale of the openHAB server
-        return Set.of(localeService.getLocale(null));
+        // Attempt to create a locale from the configured language
+        String language = config.language;
+        Locale modelLocale = localeService.getLocale(null);
+        if (!language.isBlank()) {
+            try {
+                modelLocale = Locale.forLanguageTag(language);
+            } catch (IllegalArgumentException e) {
+                logger.warn("Invalid language '{}', defaulting to server locale", language);
+            }
+        }
+        return Set.of(modelLocale);
    }

    @Override
@ -245,34 +286,20 @@ public class WhisperSTTService implements STTService {
    @Override
    public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
            throws STTException {
+
        AtomicBoolean aborted = new AtomicBoolean(false);
-        WhisperContext ctx = null;
-        WhisperState state = null;
        try {
-            var whisper = getWhisper();
-            ctx = getContext();
-            logger.debug("Creating whisper state...");
-            state = whisper.initState(ctx);
-            logger.debug("Whisper state created");
            logger.debug("Creating VAD instance...");
-            final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
+            final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE);
            VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
                    config.vadStep, config.vadSensitivity);
            logger.debug("VAD instance created");
            sttListener.sttEventReceived(new RecognitionStartEvent());
-            backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
+            backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted);
        } catch (IOException e) {
-            if (ctx != null && !config.preloadModel) {
-                ctx.close();
-            }
-            if (state != null) {
-                state.close();
-            }
            throw new STTException("Exception during initialization", e);
        }
-        return () -> {
-            aborted.set(true);
-        };
+        return () -> aborted.set(true);
    }

    private WhisperJNI getWhisper() throws IOException {
@ -339,9 +366,8 @@ public class WhisperSTTService implements STTService {
        }
    }

-    private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
-            Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
-        var releaseContext = !config.preloadModel;
+    private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener,
+            AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
        final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
        final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
        final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
@ -353,21 +379,17 @@ public class WhisperSTTService implements STTService {
        logger.debug("Max silence samples {}", nMaxSilenceSamples);
        // used to store the step samples in libfvad wanted format 16-bit int
        final short[] stepAudioSamples = new short[nSamplesStep];
-        // used to store the full samples in whisper wanted format 32-bit float
-        final float[] audioSamples = new float[nSamplesMax];
+        // used to store the full retained samples for whisper
+        final short[] audioSamples = new short[nSamplesMax];
        executor.submit(() -> {
            int audioSamplesOffset = 0;
            int silenceSamplesCounter = 0;
            int nProcessedSamples = 0;
-            int numBytesRead;
            boolean voiceDetected = false;
            String transcription = "";
-            String tempTranscription = "";
-            VAD.@Nullable VADResult lastVADResult;
            VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
            try {
-                try (state; //
-                        audioStream; //
+                try (audioStream; //
                        vad) {
                    if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
                        AudioWaveUtils.removeFMT(audioStream);
@ -376,10 +398,9 @@ public class WhisperSTTService implements STTService {
                            .order(ByteOrder.LITTLE_ENDIAN);
                    // init remaining to full capacity
                    int remaining = captureBuffer.capacity();
-                    WhisperFullParams params = getWhisperFullParams(ctx, locale);
                    while (!aborted.get()) {
                        // read until no remaining so we get the complete step samples
-                        numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
+                        int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
                                remaining);
                        if (aborted.get() || numBytesRead == -1) {
                            break;
@ -395,17 +416,15 @@ public class WhisperSTTService implements STTService {
                        while (shortBuffer.hasRemaining()) {
                            var position = shortBuffer.position();
                            short i16BitSample = shortBuffer.get();
-                            float f32BitSample = Float.min(1f,
-                                    Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
                            stepAudioSamples[position] = i16BitSample;
-                            audioSamples[audioSamplesOffset++] = f32BitSample;
+                            audioSamples[audioSamplesOffset++] = i16BitSample;
                            nProcessedSamples++;
                        }
                        // run vad
                        if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
                            logger.debug("VAD: Skipping, max length reached");
                        } else {
-                            lastVADResult = vad.analyze(stepAudioSamples);
+                            VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples);
                            if (lastVADResult.isVoice()) {
                                voiceDetected = true;
                                logger.debug("VAD: voice detected");
@ -484,43 +503,26 @@ public class WhisperSTTService implements STTService {
                                }
                            }
                        }
-                        // run whisper
-                        logger.debug("running whisper with {} seconds of audio...",
-                                Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
-                        long execStartTime = System.currentTimeMillis();
-                        var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
-                        logger.debug("whisper ended in {}ms with result code {}",
-                                System.currentTimeMillis() - execStartTime, result);
-                        // process result
-                        if (result != 0) {
-                            emitSpeechRecognitionError(sttListener);
-                            break;
-                        }
-                        int nSegments = whisper.fullNSegmentsFromState(state);
-                        logger.debug("Available transcription segments {}", nSegments);
-                        if (nSegments == 1) {
-                            tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
+                        // run whisper, either locally or by remote API
+                        String tempTranscription = (switch (config.mode) {
+                            case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage());
+                            case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage());
+                        });
+
+                        if (tempTranscription != null && !tempTranscription.isBlank()) {
                            if (config.createWAVRecord) {
                                createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
                                        locale.getLanguage());
                            }
+                            transcription += tempTranscription;
                            if (config.singleUtteranceMode) {
                                logger.debug("single utterance mode, ending transcription");
-                                transcription = tempTranscription;
                                break;
-                            } else {
-                                // start a new transcription segment
-                                transcription += tempTranscription;
-                                tempTranscription = "";
                            }
-                        } else if (nSegments == 0 && config.singleUtteranceMode) {
-                            logger.debug("Single utterance mode and no results, ending transcription");
-                            break;
-                        } else if (nSegments > 1) {
-                            // non reachable
-                            logger.warn("Whisper should be configured in single segment mode {}", nSegments);
+                        } else {
                            break;
                        }
+
                        // reset state to start with next segment
                        voiceDetected = false;
                        silenceSamplesCounter = 0;
@ -528,10 +530,6 @@ public class WhisperSTTService implements STTService {
                        logger.debug("Partial transcription: {}", tempTranscription);
                        logger.debug("Transcription: {}", transcription);
                    }
-                } finally {
-                    if (releaseContext) {
-                        ctx.close();
-                    }
                }
                // emit result
                if (!aborted.get()) {
@ -543,7 +541,7 @@ public class WhisperSTTService implements STTService {
                        emitSpeechRecognitionNoResultsError(sttListener);
                    }
                }
-            } catch (IOException e) {
+            } catch (STTException | IOException e) {
                logger.warn("Error running speech to text: {}", e.getMessage());
                emitSpeechRecognitionError(sttListener);
            } catch (UnsatisfiedLinkError e) {
@ -553,7 +551,120 @@ public class WhisperSTTService implements STTService {
        });
    }

-    private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
+    @Nullable
+    private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException {
+        logger.debug("running whisper with {} seconds of audio...",
+                Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
+        var releaseContext = !config.preloadModel;
+
+        WhisperJNI whisper = null;
+        WhisperContext ctx = null;
+        WhisperState state = null;
+        try {
+            whisper = getWhisper();
+            ctx = getContext();
+            logger.debug("Creating whisper state...");
+            state = whisper.initState(ctx);
+            logger.debug("Whisper state created");
+            WhisperFullParams params = getWhisperFullParams(ctx, language);
+
+            // convert to local whisper format (float)
+            float[] floatArray = new float[audioSamples.length];
+            for (int i = 0; i < audioSamples.length; i++) {
+                floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f));
+            }
+
+            long execStartTime = System.currentTimeMillis();
+            var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset);
+            logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime,
+                    result);
+            // process result
+            if (result != 0) {
+                throw new STTException("Cannot use whisper locally, result code: " + result);
+            }
+            int nSegments = whisper.fullNSegmentsFromState(state);
+            logger.debug("Available transcription segments {}", nSegments);
+            if (nSegments == 1) {
+                return whisper.fullGetSegmentTextFromState(state, 0);
+            } else if (nSegments == 0 && config.singleUtteranceMode) {
+                logger.debug("Single utterance mode and no results, ending transcription");
+                return null;
+            } else {
+                // non reachable
+                logger.warn("Whisper should be configured in single segment mode {}", nSegments);
+                return null;
+            }
+        } catch (IOException e) {
+            if (state != null) {
+                state.close();
+            }
+            throw new STTException("Cannot use whisper locally", e);
+        } finally {
+            if (releaseContext && ctx != null) {
+                ctx.close();
+            }
+        }
+    }
+
+    private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {
+
+        // convert to byte array, Each short has 2 bytes
+        int size = audioSamplesOffset * 2;
+        ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN);
+        for (int i = 0; i < audioSamplesOffset; i++) {
+            byteArrayBuffer.putShort(audioStream[i]);
+        }
+        javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat(
+                javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE,
+                false);
+        byte[] byteArray = byteArrayBuffer.array();
+
+        try {
+            AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat,
+                    size);
+
+            // write stream as a WAV file, in a byte array stream :
+            ByteArrayInputStream byteArrayInputStream = null;
+            try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
+                AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos);
+                byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray());
+            }
+
+            // prepare HTTP request
+            HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient();
+            MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider();
+            multiPartContentProvider.addFilePart("file", "audio.wav",
+                    new InputStreamContentProvider(byteArrayInputStream), null);
+            multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null);
+            multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null);
+            multiPartContentProvider.addFieldPart("temperature",
+                    new StringContentProvider(Float.toString(this.config.temperature)), null);
+            if (!language.isBlank()) {
+                multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null);
+            }
+            Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST)
+                    .content(multiPartContentProvider);
+            if (!config.apiKey.isBlank()) {
+                request = request.header("Authorization", "Bearer " + config.apiKey);
+            }
+            // execute the request
+            ContentResponse response = request.send();
+
+            // check the HTTP status code from the response
+            int statusCode = response.getStatus();
+            if (statusCode < 200 || statusCode >= 300) {
+                logger.debug("HTTP error: Received status code {}, full error is {}", statusCode,
+                        response.getContentAsString());
+                throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode);
+            }
+            return response.getContentAsString();
+
+        } catch (InterruptedException | TimeoutException | ExecutionException | IOException e) {
+            throw new STTException("Exception during attempt to get speech recognition result from api", e);
+        }
+    }
+
+    private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException {
        WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
        var params = new WhisperFullParams(strategy);
        params.temperature = config.temperature;
@ -570,7 +681,7 @@ public class WhisperSTTService implements STTService {
            params.grammarPenalty = config.grammarPenalty;
        }
        // there is no single language models other than the english ones
-        params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
+        params.language = getWhisper().isMultilingual(context) ? language : "en";
        // implementation assumes this options
        params.translate = false;
        params.detectLanguage = false;
@ -605,7 +716,7 @@ public class WhisperSTTService implements STTService {
        }
    }

-    private void createAudioFile(float[] samples, int size, String transcription, String language) {
+    private void createAudioFile(short[] samples, int size, String transcription, String language) {
        createSamplesDir();
        javax.sound.sampled.AudioFormat jAudioFormat;
        ByteBuffer byteBuffer;
@ -615,7 +726,7 @@ public class WhisperSTTService implements STTService {
                    WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
            byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
            for (int i = 0; i < size; i++) {
-                byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
+                byteBuffer.putShort(samples[i]);
            }
        } else {
            logger.debug("Saving audio file with sample format f32");
@ -623,7 +734,7 @@ public class WhisperSTTService implements STTService {
                    WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
            byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
            for (int i = 0; i < size; i++) {
-                byteBuffer.putFloat(samples[i]);
+                byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f)));
            }
        }
        AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
--- a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml
+++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml
@ -11,7 +11,7 @@
 		</parameter-group>
 		<parameter-group name="vad">
 			<label>Voice Activity Detection</label>
-			<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
+			<description>Configure the VAD mechanism used to isolate single phrases to feed whisper with.</description>
 		</parameter-group>
 		<parameter-group name="whisper">
 			<label>Whisper Options</label>
@ -19,7 +19,7 @@
 		</parameter-group>
 		<parameter-group name="grammar">
 			<label>Grammar</label>
-			<description>Define a grammar to improve transcrptions.</description>
+			<description>Define a grammar to improve transcriptions.</description>
 		</parameter-group>
 		<parameter-group name="messages">
 			<label>Info Messages</label>
@ -30,9 +30,27 @@
 			<description>Options added for developers.</description>
 			<advanced>true</advanced>
 		</parameter-group>
+		<parameter-group name="openaiapi">
+			<label>API Configuration Options</label>
+			<description>Configure OpenAI compatible API, if you don't want to use the local model.</description>
+		</parameter-group>
+		<parameter name="mode" type="text" groupName="stt">
+			<label>Local Mode Or API</label>
+			<description>Use the local model or the OpenAI compatible API.</description>
+			<default>LOCAL</default>
+			<options>
+				<option value="LOCAL">Local</option>
+				<option value="API">OpenAI API</option>
+			</options>
+		</parameter>
 		<parameter name="modelName" type="text" groupName="stt" required="true">
-			<label>Model Name</label>
-			<description>Model name without extension.</description>
+			<label>Local Model Name</label>
+			<description>Model name without extension. Local mode only.</description>
+		</parameter>
+		<parameter name="language" type="text" groupName="whisper">
+			<label>Language</label>
+			<description>If specified, speed up recognition by avoiding auto-detection. Default to system locale.</description>
+			<default></default>
 		</parameter>
 		<parameter name="preloadModel" type="boolean" groupName="stt">
 			<label>Preload Model</label>
@ -225,5 +243,20 @@
 			<default>false</default>
 			<advanced>true</advanced>
 		</parameter>
+		<parameter name="apiKey" type="text" groupName="openaiapi">
+			<label>API Key</label>
+			<description>Key to access the API</description>
+			<default></default>
+		</parameter>
+		<parameter name="apiUrl" type="text" groupName="openaiapi">
+			<label>API Url</label>
+			<description>OpenAI compatible API URL. Default to OpenAI transcription service.</description>
+			<default>https://api.openai.com/v1/audio/transcriptions</default>
+		</parameter>
+		<parameter name="apiModelName" type="text" groupName="openaiapi">
+			<label>API Model</label>
+			<description>Model name to use (API only). Default to OpenAI only available model (whisper-1).</description>
+			<default>whisper-1</default>
+		</parameter>
 	</config-description>
 </config-description:config-descriptions>