Speech

The Speech API gives you access to TTS and Whisper models for performing speech-to-text (STT) and text-to-speech (TTS) allowing the following:

Creating audio from text (TTS)
Transcribing audio to text (STT)
Translating non-English audio to English text
Converting non-English audio to English audio

Config

TTS and SST configs are provided in configs/prompts/speech/*.

For TTS you can define:

The model and fallback models used by chat, eg tts-1.
The output voice. One of alloy, echo, fable, onyx, nova, or shimmer.
The output speed, from 0.25 to 4.0.
The response format mp3, opus, aac, flac, wav, or pcm.

For SST you can define:

The model. Currently only whisper-1 is available.
A default prompt of how to transcript the provided audio.
The timestamp granularity. Either segment, or word.

Audio responses

For endpoints that result in audio output, the content-type of the response corresponds to the response format provided in the config (eg audio/mpeg)