Core Concepts
Voice & Speech
Talknex pipelines Deepgram STT → GPT-4o → ElevenLabs TTS to deliver sub-600ms responses with natural, human-quality voice.
Voice providers
Two TTS providers are available, each with different trade-offs:
- • 500+ voices across 30 languages
- • Voice cloning on Fully Managed
- • Turbo v2 model: first word in ~280ms
- • Emotion and style controls
- • 6 voices (alloy, echo, fable, onyx, nova, shimmer)
- • Slightly lower quality but more consistent
- • Useful when you use OpenAI exclusively
- • No cloning support
Voice parameters
voiceIdElevenLabs voice ID or OpenAI voice name. Find IDs in the Voice Library within the dashboard.
voiceSpeedPlayback speed multiplier. Default 1.0. Increase to 1.1–1.2 for more energetic sales personas.
voicePitchPitch shift relative to baseline. 0.0 is neutral. Negative values produce a deeper voice.
voiceStabilityElevenLabs stability. Higher = more consistent but less expressive. Default 0.5.
voiceSimilarityBoostElevenLabs similarity boost. Higher = sounds more like the original voice sample.
Interruption handling
Deepgram's VAD (voice activity detection) monitors the caller's audio track continuously. When speech is detected while TTS is playing, the agent pauses and listens — this is called an interruption.
The interruptionSensitivity parameter controls how aggressively this triggers:
0.3 (Low)Agent continues speaking unless the caller speaks loudly and clearly. Good for IVR-style flows where you don't want background noise to interrupt.
0.5 (Default)Balanced. Normal speech interrupts the agent. Most use-cases work best here.
0.8 (High)Agent yields at the first sign of caller speech. Great for conversational, empathetic personas like support agents.
Latency optimization
Target end-to-end latency is under 600ms. If you're seeing higher numbers, work through this checklist:
Check Deepgram region
Set the Deepgram endpoint to the region closest to your API server. Mismatched regions add 80–200ms.
Use ElevenLabs Turbo v2
Turbo v2 starts streaming audio ~40% faster than the standard model. Always prefer it for real-time calls.
Shorten the system prompt
Every 100 tokens in the context adds ~20ms to GPT-4o TTFT. Keep prompts under 800 tokens.
Avoid function calls for common responses
Each function call adds one extra LLM round-trip (~300ms). Use simple conditional prompt logic for high-frequency patterns.
Enable optimize_streaming_latency
In Agent → Voice → Advanced, toggle "Optimize streaming latency." This enables ElevenLabs latency optimization mode.
Voice cloning
Available on Fully Managed. Clone a voice from a clean audio sample to give your agent a proprietary sound identity.
Minimum sample length: 5 minutes of clean audio (10+ minutes recommended for higher quality).
Accepted formats: MP3, WAV, FLAC, M4A. No background music or multiple speakers.
Process: Upload in Settings → Voice Library → Clone Voice. ElevenLabs processes the sample in ~2 minutes. The clone then appears in your voice picker.