Docs/Voice & Speech

Core Concepts

Voice & Speech

Talknex pipelines Deepgram STT → GPT-4o → ElevenLabs TTS to deliver sub-600ms responses with natural, human-quality voice.

Voice providers

Two TTS providers are available, each with different trade-offs:

ElevenLabsRecommended
  • • 500+ voices across 30 languages
  • • Voice cloning on Fully Managed
  • • Turbo v2 model: first word in ~280ms
  • • Emotion and style controls
OpenAI TTS
  • • 6 voices (alloy, echo, fable, onyx, nova, shimmer)
  • • Slightly lower quality but more consistent
  • • Useful when you use OpenAI exclusively
  • • No cloning support

Voice parameters

voiceId
string

ElevenLabs voice ID or OpenAI voice name. Find IDs in the Voice Library within the dashboard.

voiceSpeed
number 0.5–2.0

Playback speed multiplier. Default 1.0. Increase to 1.1–1.2 for more energetic sales personas.

voicePitch
number -1.0–1.0

Pitch shift relative to baseline. 0.0 is neutral. Negative values produce a deeper voice.

voiceStability
number 0.0–1.0

ElevenLabs stability. Higher = more consistent but less expressive. Default 0.5.

voiceSimilarityBoost
number 0.0–1.0

ElevenLabs similarity boost. Higher = sounds more like the original voice sample.

Interruption handling

Deepgram's VAD (voice activity detection) monitors the caller's audio track continuously. When speech is detected while TTS is playing, the agent pauses and listens — this is called an interruption.

The interruptionSensitivity parameter controls how aggressively this triggers:

0.3 (Low)

Agent continues speaking unless the caller speaks loudly and clearly. Good for IVR-style flows where you don't want background noise to interrupt.

0.5 (Default)

Balanced. Normal speech interrupts the agent. Most use-cases work best here.

0.8 (High)

Agent yields at the first sign of caller speech. Great for conversational, empathetic personas like support agents.

Latency optimization

Target end-to-end latency is under 600ms. If you're seeing higher numbers, work through this checklist:

1

Check Deepgram region

Set the Deepgram endpoint to the region closest to your API server. Mismatched regions add 80–200ms.

2

Use ElevenLabs Turbo v2

Turbo v2 starts streaming audio ~40% faster than the standard model. Always prefer it for real-time calls.

3

Shorten the system prompt

Every 100 tokens in the context adds ~20ms to GPT-4o TTFT. Keep prompts under 800 tokens.

4

Avoid function calls for common responses

Each function call adds one extra LLM round-trip (~300ms). Use simple conditional prompt logic for high-frequency patterns.

5

Enable optimize_streaming_latency

In Agent → Voice → Advanced, toggle "Optimize streaming latency." This enables ElevenLabs latency optimization mode.

Voice cloning

Available on Fully Managed. Clone a voice from a clean audio sample to give your agent a proprietary sound identity.

Minimum sample length: 5 minutes of clean audio (10+ minutes recommended for higher quality).

Accepted formats: MP3, WAV, FLAC, M4A. No background music or multiple speakers.

Process: Upload in Settings → Voice Library → Clone Voice. ElevenLabs processes the sample in ~2 minutes. The clone then appears in your voice picker.

Voice & Speech — Talknex Docs