WSS /api/v1/generate/ws
The WebSocket endpoint delivers generated audio as a stream of binary PCM chunks, enabling lower-latency playback compared to the HTTP endpoint.
Supported models: higgs2p5, chatterbox, qwen3-tts
Protocol
The WebSocket session follows a 3-message handshake:
The server then sends:
- Binary frames — raw PCM audio chunks (16-bit, 24 kHz, mono)
{"type": "complete"}— JSON frame signaling end of stream
Parameters
Parameters in the TTS request JSON match those of the HTTP endpoint for each model. See Generate Audio for the full parameter list per model.| Model | Key parameters |
|---|---|
higgs2p5 | text, voice, voice_id, voice_url, voice_settings, sampling |
chatterbox | text, language_id, voice_id, audio_prompt_file, exaggeration, temperature |
qwen3-tts | text, voice, voice_id, voice_url, language, instructions, voice_settings |
Examples
The binary frames contain raw PCM audio: 16-bit signed integers, 24 kHz sample rate, mono channel. Use a library like
soundfile (Python) or AudioContext (browser) to decode and play.