Voice & Speech

Talknex pipelines Deepgram STT → a real-time LLM → ElevenLabs TTS to respond in about a second with a natural, human-quality voice.

Voice providers

Two TTS providers are available, each with different trade-offs:

ElevenLabsRecommended

• 500+ voices across 30 languages
• Voice cloning on Pro & Custom plans
• Flash v2.5 model: first word in ~150ms
• Emotion and style controls

OpenAI TTS

• 6 voices (alloy, echo, fable, onyx, nova, shimmer)
• Slightly lower quality but more consistent
• Useful when you use OpenAI exclusively
• No cloning support

Voice parameters

voiceId

string

ElevenLabs voice ID or OpenAI voice name. Find IDs in the Voice Library within the dashboard.

voiceSpeed

number 0.5–2.0

Playback speed multiplier. Default 1.0. Increase to 1.1–1.2 for more energetic sales personas.

voicePitch

number -1.0–1.0

Pitch shift relative to baseline. 0.0 is neutral. Negative values produce a deeper voice.

voiceStability

number 0.0–1.0

ElevenLabs stability. Higher = more consistent but less expressive. Default 0.5.

voiceSimilarityBoost

number 0.0–1.0

ElevenLabs similarity boost. Higher = sounds more like the original voice sample.

Interruption handling

Deepgram's VAD (voice activity detection) monitors the caller's audio track continuously. When speech is detected while TTS is playing, the agent pauses and listens — this is called an interruption.

The interruptionSensitivity parameter controls how aggressively this triggers:

0.3 (Low)

Agent continues speaking unless the caller speaks loudly and clearly. Good for IVR-style flows where you don't want background noise to interrupt.

0.5 (Default)

Balanced. Normal speech interrupts the agent. Most use-cases work best here.

0.8 (High)

Agent yields at the first sign of caller speech. Great for conversational, empathetic personas like support agents.

Latency optimization

A turn-based call replies in about a second — the agent waits for the caller to finish speaking, then streams its response. If you're seeing noticeably higher numbers, work through this checklist:

Check Deepgram region

Set the Deepgram endpoint to the region closest to your API server. Mismatched regions add 80–200ms.

Use ElevenLabs Flash v2.5

Flash v2.5 starts streaming audio faster than the standard model. Always prefer it for real-time calls.

Shorten the system prompt

Every 100 tokens in the context adds ~20ms to the LLM's time-to-first-token. Keep prompts under 800 tokens.

Avoid function calls for common responses

Each function call adds one extra LLM round-trip (~300ms). Use simple conditional prompt logic for high-frequency patterns.

Enable optimize_streaming_latency

In Agent → Voice → Advanced, toggle "Optimize streaming latency." This enables ElevenLabs latency optimization mode.

Voice cloning

Available on Pro and Custom plans. Clone a voice from a clean audio sample to give your agent a proprietary sound identity.

Minimum sample length: 5 minutes of clean audio (10+ minutes recommended for higher quality).

Accepted formats: MP3, WAV, FLAC, M4A. No background music or multiple speakers.

Process: Upload in Settings → Voice Library → Clone Voice. ElevenLabs processes the sample in ~2 minutes. The clone then appears in your voice picker.

← Agents LLM Configuration