When Voice Agents Really Work
A voice agent is not a chatbot with a microphone. The moment you move from text to speech, you inherit three new hard problems: latency, interruption, and the fact that nobody wants to listen to a bulleted list.
I've built or advised on a dozen voice projects in the last two years. The ones that worked had these four things in common.
1. A narrow job
"Handle all customer questions" is not a job. "Qualify inbound sales leads and route to the right rep" is a job. Voice amplifies everything, including ambiguity in scope. A narrow agent feels competent. A broad one feels frustrating.
2. Realistic latency budgets
Users will forgive a 400ms pause. They will not forgive 1200ms. This means cutting the STT→LLM→TTS stack wherever you can. Native audio-to-audio models like Gemini Flash Live collapse the stack into a single round-trip and usually win on latency against any chained pipeline.
3. Interruption that works
If your agent keeps talking when the user starts speaking, it's already lost. Interruption handling isn't a nice-to-have — it's table stakes for feeling like a real conversation. Test it obsessively.
4. A graceful handoff
Every voice agent eventually runs into something it shouldn't handle. The ones that feel trustworthy know exactly when to say "let me get a human on the line" and actually do it. The ones that don't will eventually make a promise they can't keep, and that's the call you'll regret shipping.
Bonus rule: before you write a single prompt, sit down with someone from the team that currently handles these calls and listen to ten real recordings. Everything you need to know about scope, tone, and the edge cases is in those ten calls.
