Skip to content


The Speech API gives you access to TTS and Whisper models for performing speech-to-text (STT) and text-to-speech (TTS) allowing the following:

  • Creating audio from text (TTS)
  • Transcribing audio to text (STT)
  • Translating non-English audio to English text
  • Converting non-English audio to English audio


TTS and SST configs are provided in configs/prompts/speech/*.

For TTS you can define:

  • The model and fallback models used by chat, eg tts-1.
  • The output voice. One of alloy, echo, fable, onyx, nova, or shimmer.
  • The output speed, from 0.25 to 4.0.
  • The response format mp3, opus, aac, flac, wav, or pcm.

For SST you can define:

  • The model. Currently only whisper-1 is available.
  • A default prompt of how to transcript the provided audio.
  • The timestamp granularity. Either segment, or word.

Audio responses

For endpoints that result in audio output, the content-type of the response corresponds to the response format provided in the config (eg audio/mpeg)