POST /api/v1/generate
Content-Type: multipart/form-data
Authentication
Send your API key in theAuthorization header as a Bearer token.
Audio Transcription (ASR)
Supported model: Whisper V3 Turbo (model=whisper_v3_turbo)
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
model | string | Required | Must be whisper_v3_turbo. |
file | file | Required | Audio file to transcribe (MP3, WAV, M4A, OGG, WebM). |
language | string | Optional | Spoken language code (default en). Supports 99 languages. |
response_format | string | Optional | Output format: json or text. |
Example
Text-to-Speech (TTS)
Three models are available. All acceptmultipart/form-data and return a WAV audio file by default. For real-time streaming over WebSocket, see Stream Audio. To upload a voice reference for cloning, see Upload Voice Reference.
Higgs Audio V2.5 (model=higgs2p5)
| Name | Type | Required | Description |
|---|---|---|---|
model | string | Required | Must be higgs2p5. |
text | string | Required | Text to convert to speech. |
voice | string | Optional | Voice preset (e.g. Linda, Jack). |
voice_reference_file | file | Optional | Audio file for voice cloning (WAV, MP3). |
voice_id | string | Optional | Stored voice ID returned by Upload Voice Reference. |
voice_url | string | Optional | External URL to a voice reference audio sample. |
voice_name | string | Optional | Name of a saved voice from the voice library. |
voice_settings | string | Optional | JSON string with voice settings. Supports speed (default 1.0). |
sampling | string | Optional | JSON string with sampling controls: temperature (default 1.0), top_p (default 0.95), top_k (default 50). |
stream | boolean | Optional | false = return WAV file (default); true = HTTP SSE streaming. |
ChatterBox Voice Twin (model=chatterbox)
| Name | Type | Required | Description |
|---|---|---|---|
model | string | Required | Must be chatterbox. |
text | string | Required | Text to convert to speech (≤ 1,000 characters recommended). |
language_id | string | Optional | Language code (e.g. en, zh, es, ja). Supports 23 languages. Default en. |
audio_prompt_file | file | Optional | Voice reference clip for voice cloning (WAV/MP3/M4A/OGG, max 30s). |
voice_id | string | Optional | Stored voice ID returned by Upload Voice Reference. |
preset_url | string | Optional | URL to a voice preset audio sample. |
exaggeration | number | Optional | Expressiveness: 0.0 = subtle, 0.5 = balanced, 1.0+ = highly animated (default 0.5). |
temperature | number | Optional | Sampling temperature (default 0.8). |
diffusion_steps | number | Optional | Quality vs. latency. Higher = better quality, slower (default 5). |
max_tokens | integer | Optional | Upper bound on generated tokens (default 3000). |
top_p | number | Optional | Nucleus sampling ceiling (default 1.0). |
min_p | number | Optional | Nucleus sampling floor (default 0.05). |
repetition_penalty | number | Optional | Penalizes repeated tokens (default 1.2). |
seed | integer | Optional | Seed for reproducible generation (null = random). |
stream | boolean | Optional | false = return WAV file (default); true = HTTP SSE streaming. |
Qwen3 TTS (model=qwen3-tts)
Supports named speakers (CustomVoice mode) or voice cloning (Base mode). voice and voice_id/voice_url cannot be used together.
| Name | Type | Required | Description |
|---|---|---|---|
model | string | Required | Must be qwen3-tts. |
text | string | Required | Text to synthesize. |
voice | string | Optional | Named speaker for CustomVoice mode: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee. Cannot be used with voice_id or voice_url. |
voice_id | string | Optional | Stored voice ID for Base model (from Upload Voice Reference). |
voice_url | string | Optional | External URL to voice reference audio (Base model). |
voice_settings | string | Optional | JSON string with voice settings. Supports speed (default 1.0). |
language | string | Optional | Auto, Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish (default Auto). |
instructions | string | Optional | Style/emotion control (e.g. "speak cheerfully"). |
response_format | string | Optional | Output format: wav (default), pcm, mp3, flac, aac, opus. |
stream | boolean | Optional | false = return audio file (default); true = HTTP SSE streaming. |