Lydia · ElevenLabs Voice Architecture
Your approach
Recommended approach
Advisor speaks
Microphone captures audio
STT transcription
Separate service (e.g. Whisper)
Lydia (LLM)
Claude generates response
ElevenLabs TTS API
Text → voice (separate call)
Audio returned
Advisor hears response
3 separate services to manage
STT seam
LLM seam
TTS seam
Advisor speaks
Microphone captures audio
ElevenLabs conversational AI
Scribe v2 (STT)
~150ms transcription
Lydia (LLM)
Your LLM, plugged in
Flash v2.5 (TTS)
~75ms voice generation
Audio returned
Advisor hears response
1 platform, native session mgmt
Analytics + monitoring included
Total latency: ~700–900ms
Error handling: 3 failure points
Turn detection: manual
Total latency: ~500ms (optimal)
Error handling: 1 platform
Turn detection: built-in
In both approaches, Lydia's intelligence (prompts, knowledge base, Claude)
stays entirely yours. ElevenLabs only handles the voice loop.