Technical Deep Dive

Voice AI: Building Conversational Agents That Talk

A comprehensive guide to building voice-enabled AI systems. From choosing TTS and STT providers to designing natural conversations and deploying phone agents, with our own implementation as a working case study.

📖 55 min read 🔧 Implementation guide 📅 February 2026

Contents

1. The Voice AI Landscape in 2026
2. Text-to-Speech: Making AI Sound Human
3. Speech-to-Text: Understanding Human Speech
4. Building Phone Agents with Twilio
5. Conversation Design Principles
6. Handling Interruptions and Turn-Taking
7. Voice Personas and Emotional Tone
8. Case Study: The As Above Voice Agent
9. Latency Optimization: The Make-or-Break Factor
10. Cost Breakdown and Scaling Economics
11. Use Cases: Where Voice AI Shines
12. The Future of Voice Interfaces

"Please hold while I transfer you to the next available representative."

We've all heard it. We've all hated it. And we've all wondered why, in an age where AI can write essays and generate images, we're still trapped in phone trees designed in the 1990s.

The technology to fix this exists. Right now. Voice AI has crossed the threshold from "impressively awkward" to "genuinely useful"—and in many cases, actually preferable to human alternatives. The latency is manageable. The voices are natural. The understanding is robust enough for real work.

This guide is for builders. We'll cover everything from the fundamentals of speech synthesis to the engineering details of phone integration, with a complete case study of our own voice agent system. The one you can call right now at (877) 939-6093 and talk to Axis, Aria, or Marcus about what we're building.

By the end, you'll understand not just what's possible, but exactly how to build it.

1. The Voice AI Landscape in 2026

Voice AI has matured dramatically in the past two years. What was once the domain of massive enterprises with custom solutions is now accessible to startups and individual developers. Here's what changed:

The Convergence of Three Technologies

Voice AI isn't one technology. It\'s three, working in concert:

Speech-to-Text (STT): Converting spoken audio into text that AI can process. Accuracy now exceeds 95% for clear speech, with real-time streaming capabilities.
Large Language Models (LLMs): The "brain" that understands context, generates responses, and makes decisions. This is where the intelligence lives.
Text-to-Speech (TTS): Converting AI text responses back into natural-sounding speech. Modern voices are nearly indistinguishable from humans.

Each component has improved independently, but the real breakthrough is in integration—systems that pipeline these together with low enough latency to feel like natural conversation.

┌─────────────────────────────────────────────────────────────────────────┐ │ VOICE AI CONVERSATION FLOW │ └─────────────────────────────────────────────────────────────────────────┘ USER SPEAKS AI RESPONDS ──────────── ──────────── │ ▲ ▼ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Audio │ ───▶ │ STT │ ───▶ │ LLM │ ───▶ │ TTS │ │ Input │ │ (Whisper│ │ (Claude │ │(Eleven │ │ │ │ Deepgram│ │ GPT-4) │ │ Labs) │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ ▼ ▼ ▼ ~300ms ~800ms ~400ms │ │ │ └────────────────┴────────────────┘ │ ▼ TOTAL: 1.5-2.5 seconds (Goal: Under 2 seconds)

What's Different in 2026

Capability	2023	2026
TTS Naturalness	Obviously synthetic	Often indistinguishable from human
STT Accuracy	~90% (clear audio)	~97% (even noisy environments)
End-to-end latency	3-5+ seconds	1.2-2.5 seconds
Interruption handling	Primitive or none	Natural barge-in support
Emotional range	Flat, monotone	Expressive, contextual
Cost per minute	$0.15-0.30	$0.03-0.10
Setup complexity	Months of development	Days to weeks

The Major Players

The voice AI ecosystem has consolidated around a few key providers in each category:

Full-Stack Platforms: Bland AI, Vapi, Retell AI, Vocode—provide complete voice agent solutions with minimal code
TTS Leaders: ElevenLabs, OpenAI TTS, Google Cloud TTS, Amazon Polly, Play.ht, Cartesia
STT Leaders: OpenAI Whisper, Deepgram, Google Speech-to-Text, AssemblyAI, Rev AI
Telephony: Twilio, Vonage, Bandwidth, Telnyx—bridge voice AI to traditional phone networks

💡 Build vs. Buy Decision

Full-stack platforms like Bland or Vapi can get you to a working phone agent in hours, not weeks. The trade-off is flexibility and control. For most use cases, starting with a platform and migrating to custom infrastructure later (if needed) is the right call. We'll cover both approaches.

2. Text-to-Speech: Making AI Sound Human

The voice is the face of your AI. Get this wrong, and nothing else matters—users will hang up before your brilliant conversation design even comes into play. Let's examine the options.

ElevenLabs

Premium Quality / Voice Cloning

The current leader in natural-sounding speech. ElevenLabs made waves with voice cloning and has maintained quality leadership. Their voices have subtle breathing patterns, natural cadence variations, and emotional expressiveness that other providers struggle to match.

Latency

~300-500ms TTFB

Price

$0.18-0.30/1K chars

Streaming

Yes (WebSocket)

Voice Cloning

Yes (Professional)

Strengths

Industry-leading voice quality and naturalness
Voice cloning from audio samples (as little as 30 seconds)
Extensive voice library with diverse accents and styles
Voice design tools for creating custom voices
Emotional control and style adjustment
Excellent streaming support for real-time applications

Weaknesses

Premium pricing. Most expensive major option
Latency slightly higher than some alternatives
Character-based pricing can surprise you at scale

Best For

Customer-facing applications where voice quality directly impacts perception. Brand voice development. Narrative content. Any use case where "sounding human" is critical.

OpenAI TTS

Good Quality / Simple Integration

OpenAI's text-to-speech offering provides solid quality with the convenience of being part of the OpenAI ecosystem. If you're already using GPT-4 for your LLM, adding OpenAI TTS keeps everything in one API.

Latency

~200-400ms TTFB

Price

$0.015/1K chars (standard)

Streaming

Yes

Voices

6 built-in

Strengths

Excellent price-to-quality ratio
Low latency, good for real-time applications
Simple API, familiar if you use OpenAI
HD model available for higher quality
Consistent, reliable performance

Weaknesses

Limited voice selection (only 6 voices)
No voice cloning capability
Less expressive than ElevenLabs
Limited control over speech style and emotion

Best For

Internal tools, prototyping, cost-sensitive applications, or scenarios where the voice is functional rather than brand-defining.

Google Cloud Text-to-Speech

Enterprise / Multilingual

Google's TTS has been in the market longer than most and shows it with comprehensive language support and enterprise-grade reliability. The WaveNet and Neural2 voices are solid, if not quite market-leading.

Latency

~200-350ms TTFB

Price

$4-16/1M chars

Languages

40+ languages

Voices

400+ voices

Strengths

Massive language and locale coverage
Excellent SSML support for fine control
Enterprise reliability and SLAs
Custom voice creation (enterprise)
Good documentation and support

Weaknesses

Voice quality slightly behind ElevenLabs
Complex pricing tiers
Custom voices require significant investment

Best For

Multilingual applications, enterprise deployments with existing GCP infrastructure, applications requiring SSML precision.

Amazon Polly

AWS Ecosystem / Cost-Effective

Polly integrates seamlessly with AWS services, making it the natural choice if you're already in the Amazon ecosystem. Neural voices are decent, though not quite matching ElevenLabs or even OpenAI in naturalness.

Latency

~150-300ms TTFB

Price

$4/1M chars (neural)

Languages

30+ languages

Integration

Native AWS

Strengths

Very low latency
Cost-effective at scale
Tight AWS integration (S3, Lambda, Connect)
Newscaster and conversational styles
SSML support with Amazon-specific extensions

Weaknesses

Voice quality behind newer entrants
Neural voices still sound slightly synthetic
Limited emotional range

Best For

AWS-native applications, IVR systems, high-volume applications where cost matters more than voice quality.

Cartesia

Ultra-Low Latency / Real-Time

Cartesia is optimized specifically for real-time voice AI applications. Their "Sonic" model prioritizes latency without sacrificing quality, making it excellent for conversational use cases where every millisecond counts.

Latency

~90-150ms TTFB

Price

~$0.04/1K chars

Streaming

Word-level

Focus

Real-time voice

Strengths

Fastest time-to-first-byte in the market
Excellent for real-time conversation
Word-level streaming for immediate response
Good voice quality despite speed focus
Emotion and speed controls

Weaknesses

Smaller voice library than established providers
Less mature ecosystem
Voice cloning still developing

Best For

Real-time voice agents where latency is the priority. Phone agents. Interactive voice applications where natural conversation flow matters most.

TTS Selection Framework

Priority	Recommended Provider	Why
Voice quality above all	ElevenLabs	Best naturalness, cloning, expressiveness
Lowest latency	Cartesia	Optimized for real-time, sub-150ms TTFB
Budget-conscious	OpenAI TTS or Polly	Good quality at fraction of cost
Multilingual	Google Cloud TTS	40+ languages, extensive locale support
AWS-native	Amazon Polly	Seamless integration, low latency
Custom brand voice	ElevenLabs	Voice cloning from samples

⚠️ Voice Quality vs. Latency Trade-off

Higher quality voices often come with higher latency. For phone agents, you might need to accept slightly less natural voices in exchange for conversational flow. Test with real users. They often prefer a faster "slightly synthetic" voice over a slower "perfectly natural" one because conversation flow matters more than audio fidelity.

3. Speech-to-Text: Understanding Human Speech

If TTS is the mouth, STT is the ears. And ears need to work in challenging conditions: background noise, accents, mumbling, phone line compression, people talking over each other. Here's how the options stack up.

OpenAI Whisper

High Accuracy / Open Source Available

Whisper changed the STT landscape when OpenAI released it. The API version offers excellent accuracy with simple pricing, while the open-source models can be self-hosted for control and cost savings.

Accuracy

~96-98% (clear audio)

Price (API)

$0.006/minute

Streaming

No (batch only)

Self-Host

Yes (open weights)

Strengths

Excellent accuracy, especially on diverse accents
Works well with noisy audio
Handles multiple languages and code-switching
Can be self-hosted for privacy/cost
Simple, predictable pricing

Weaknesses

API doesn't support real-time streaming
Latency problematic for live conversation (must wait for utterance)
Self-hosting requires GPU resources

Best For

Batch transcription, post-processing audio, applications where you can wait for complete utterances. Self-hosting when you need privacy or have high volume.

Deepgram

Real-Time Streaming / Developer Friendly

Deepgram built their platform specifically for real-time voice AI. Their streaming capabilities, word-level timestamps, and low latency make them ideal for conversational applications where you can't wait for complete sentences.

Accuracy

~95-97%

Price

$0.0043/minute (Nova-2)

Latency

~100-300ms

Streaming

Yes (WebSocket)

Strengths

Excellent real-time streaming with low latency
Word-level timestamps and confidence scores
Interim results for faster perceived response
End-of-speech detection built in
Good handling of phone audio quality
Developer-friendly APIs and SDKs

Weaknesses

Slightly lower accuracy than Whisper on diverse accents
Pricing can add up for high-volume applications
Some models better than others—need to test

Best For

Real-time voice agents, phone systems, any application requiring streaming transcription with low latency. The go-to for conversational AI.

Google Cloud Speech-to-Text

Enterprise / Multilingual

Google's STT has mature streaming support, excellent language coverage, and the enterprise reliability you'd expect. The Chirp model represents their latest advancement in accuracy.

Accuracy

~95-97%

Price

$0.016-0.024/min

Languages

125+ languages

Streaming

Yes

Strengths

Extensive language and dialect support
Good streaming with interim results
Model adaptation for domain-specific vocabulary
Speaker diarization
Enterprise support and SLAs

Weaknesses

Higher price than some alternatives
Complex pricing model
Latency slightly higher than Deepgram

Best For

Enterprise deployments, multilingual applications, situations requiring speaker identification or domain-specific vocabulary.

AssemblyAI

Feature-Rich / Analysis

AssemblyAI differentiates with built-in audio intelligence features beyond basic transcription—sentiment analysis, topic detection, PII redaction, and more.

Accuracy

~95-97%

Price

$0.01/minute base

Features

Sentiment, PII, topics

Streaming

Yes (real-time)

Strengths

Built-in audio intelligence (sentiment, topics, summaries)
Real-time streaming support
PII detection and redaction
LeMUR for LLM-powered analysis
Good documentation

Weaknesses

Extra features add to cost
Latency not as optimized as Deepgram
Smaller market presence than Google/AWS

Best For

Applications needing transcription plus analysis. Call centers wanting sentiment and topic extraction. Compliance use cases requiring PII handling.

STT Selection Framework

Priority	Recommended Provider	Why
Real-time conversation	Deepgram	Lowest latency streaming, interim results
Highest accuracy	OpenAI Whisper (API or self-host)	Best overall accuracy, especially diverse accents
Budget + accuracy	Deepgram Nova-2	Good accuracy at $0.0043/min
Multilingual	Google Cloud STT	125+ languages, dialect support
Analytics included	AssemblyAI	Transcription + sentiment + topics
Privacy/self-host	Whisper (open source)	Run on your infrastructure

💡 The Streaming Imperative

For real-time voice agents, streaming STT isn't optional. It\'s essential. Without it, you must wait for the user to finish speaking entirely before processing begins. With streaming, you can start processing while they're still talking, and detect when they've paused. This alone can shave 500ms+ off perceived latency.

4. Building Phone Agents with Twilio

Connecting your voice AI to the telephone network requires a bridge between the internet and PSTN (Public Switched Telephone Network). Twilio is the most mature option, though alternatives like Telnyx and Vonage exist. Here's how the pieces fit together.

The Phone Agent Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ PHONE AGENT ARCHITECTURE │ └─────────────────────────────────────────────────────────────────────────┘ CALLER YOUR SYSTEMS ────── ──────────── 📱 User dials ┌──────────────┐ (877) 939-6093 ────▶│ Twilio │ │ Voice API │ └──────┬───────┘ │ │ WebSocket (Media Stream) ▼ ┌──────────────┐ ┌──────────────┐ │ Your App │◀────▶│ STT │ │ (WebSocket │ │ (Deepgram) │ │ Server) │ └──────────────┘ └──────┬───────┘ │ │ Text ▼ ┌──────────────┐ │ LLM │ │ (Claude/ │ │ GPT-4) │ └──────┬───────┘ │ │ Response text ▼ ┌──────────────┐ │ TTS │ │ (ElevenLabs/ │ │ Cartesia) │ └──────┬───────┘ │ │ Audio stream ▼ ┌──────────────┐ 📱 User hears ◀────│ Twilio │ AI response │ (returns │ │ audio) │ └──────────────┘

Twilio Setup Essentials

1. Provision a Phone Number

Twilio offers local, toll-free, and short code numbers. For voice agents, toll-free numbers (800, 888, 877, etc.) are often preferred—they're recognized, trusted, and have no per-minute charges to the caller.

# Monthly costs (as of 2026)
Local number:      $1.15/month + $0.0085/min inbound
Toll-free number:  $2.15/month + $0.0130/min inbound
Short code:        $1,000/month (for SMS, not voice)

2. Configure the Webhook

When a call comes in, Twilio sends a webhook to your server. You respond with TwiML (Twilio Markup Language) instructing what to do—play audio, gather input, or start a media stream.

<Response>
  <Connect>
    <Stream url="wss://your-server.com/media-stream" />
  </Connect>
</Response>

3. Handle the Media Stream

Twilio's Media Streams send real-time audio over WebSocket in mulaw or PCM format. Your server receives this audio, sends it to STT, processes through your LLM, generates TTS, and sends audio back.

// Simplified WebSocket handler (Node.js)
wss.on('connection', (ws) => {
  const deepgram = createDeepgramStream();
  const conversation = new ConversationManager();
  
  ws.on('message', async (message) => {
    const data = JSON.parse(message);
    
    if (data.event === 'media') {
      // Audio chunk from caller
      const audio = Buffer.from(data.media.payload, 'base64');
      deepgram.send(audio);
    }
    
    if (data.event === 'start') {
      // Call started, initialize conversation
      conversation.initialize(data.start.callSid);
    }
  });
  
  deepgram.on('transcription', async (text) => {
    // User said something
    const response = await conversation.generateResponse(text);
    const audioStream = await tts.synthesize(response);
    
    // Send audio back to Twilio
    streamAudioToTwilio(ws, audioStream);
  });
});

Full-Stack Platforms: The Easier Path

Building the above from scratch takes significant engineering effort. Full-stack platforms handle the complexity, letting you focus on conversation design:

Vapi

Developer-Focused / Flexible

Vapi provides the infrastructure for voice AI while giving you control over the LLM and conversation logic. You define your agent's behavior; they handle the telephony, STT, and TTS orchestration.

Setup Time

Hours

Pricing

$0.05/min + providers

LLM Options

BYO or built-in

Customization

High

Bland AI

Turnkey / Enterprise

Bland offers a more opinionated, turnkey solution. You define conversation flows through their interface or API, and they handle everything. Less flexibility, but faster time-to-production.

Setup Time

Minutes to hours

Pricing

$0.09/min all-inclusive

LLM Options

Their models

Customization

Medium

Retell AI

Low Latency / Modern

Retell focuses on ultra-low latency and natural conversation flow. Their platform is optimized for feeling responsive, with good interruption handling built in.

Setup Time

Hours

Pricing

Usage-based

Latency

~800ms end-to-end

Customization

High

✅ Our Recommendation

Start with a platform like Vapi or Retell. Get your conversation design working, validate with real users, then decide if you need custom infrastructure. Most companies never need to build their own. The platforms continue improving and scaling.

5. Conversation Design Principles

Technology is necessary but not sufficient. A voice agent with perfect TTS and zero latency will still fail if the conversation design is poor. This is where the art meets the engineering.

The Fundamental Principle: Reduce Cognitive Load

Phone calls are cognitively demanding. Unlike text, users can't re-read or skim ahead. Every design decision should minimize the mental effort required to understand and respond.

1. Front-Load Important Information

Bad

"Thank you so much for calling. We really appreciate you reaching out to us today. Before we begin, I'd like to let you know that this call may be recorded for quality assurance purposes. My name is Aria, and I'm here to help you with any questions you might have about our services. How can I assist you today?"

45 words before the question. User has zoned out.

Good

"Hi, this is Aria from As Above. How can I help you today?"

12 words. Immediately actionable.

2. One Question at a Time

Bad

"I'd be happy to help you with that. Can you tell me your account number, and also what the issue is, and when it started happening?"

Three questions. User will forget the first by the third.

Good

"Sure, I can help with that. What's your account number?"

One question. Clear next step.

3. Confirm Understanding, Don't Just Acknowledge

User

"I need to reschedule my appointment from Tuesday to Thursday."

Bad

"Okay, I can help with that."

User doesn't know if you understood the details.

Good

"Got it—moving your appointment from Tuesday to Thursday. What time works for you on Thursday?"

Confirms understanding, then asks the next question.

The Conversation Flow Framework

📞 Voice Agent Conversation Structure

Opening

Greet → Identify → Invite
"Hi, this is [Name] from [Company]. How can I help?"
Keep it under 15 words. Don't read a disclaimer.

Discovery

Listen → Clarify → Confirm
Understand what they need. Ask clarifying questions one at a time.
"So you're looking to [X]—is that right?"

Action

Do → Narrate → Verify
Take the action. Tell them what you're doing. Confirm it worked.
"I'm updating that now... Done. Your new appointment is Thursday at 2pm."

Closing

Summarize → Invite More → Exit Gracefully
"You're all set for Thursday at 2pm. Anything else I can help with?"
If no: "Great, have a good day. Goodbye."

Handling Edge Cases

When You Don't Understand

Progressive Fallback

First miss: "Sorry, I didn't catch that. Could you say that again?"

Second miss: "I'm having trouble understanding. Let me ask differently— are you calling about [most likely intent]?"

Third miss: "I apologize, I'm not able to help with this over the phone. Let me transfer you to someone who can, or you can email us at..."

When the User Goes Off-Script

Users will ask things you didn't anticipate. Your agent needs graceful handling:

Acknowledge: "That's a great question..."
Attempt: Try to answer if the LLM has relevant knowledge
Redirect: "I don't have information on that, but I can help you with [related thing] or connect you with someone who knows more."
Learn: Log unexpected queries to improve future versions

Silence Handling

Long silence is awkward on the phone. But you also don't want to interrupt someone who's thinking or looking something up.

// Silence handling strategy
3 seconds:  Do nothing (they might be thinking)
6 seconds:  Soft prompt: "Take your time..."
10 seconds: Check-in: "Are you still there?"
15 seconds: Offer help: "If you need a moment, I can wait, 
            or is there something I can help with?"
20 seconds: Exit: "I'll let you go. Feel free to call back 
            when you're ready."

💡 The 3-Second Rule

After the user stops speaking, wait at least 3 seconds before responding to ensure they've finished their thought. Interrupting mid-sentence is jarring. But waiting too long feels slow. 3 seconds is the sweet spot for most conversations.

6. Handling Interruptions and Turn-Taking

Natural conversation isn't orderly. People interrupt, talk over each other, change their minds mid-sentence. A voice agent that can't handle this feels robotic. This is one of the hardest technical and design challenges.

Types of Interruptions

Type	Description	Appropriate Response
Barge-in	User starts talking while AI is speaking	Stop immediately, listen to user
Backchanneling	"Uh-huh", "okay", "right"	Continue speaking (don't treat as interruption)
Correction	"No, I meant..." while AI responds	Stop, acknowledge correction, adjust
Elaboration	User adds more after AI starts	Pause, incorporate new info, continue

Technical Implementation

Voice Activity Detection (VAD)

VAD determines when the user is speaking vs. ambient noise. Good VAD is critical for:

Detecting when user starts speaking (trigger barge-in)
Detecting when user stops speaking (trigger AI response)
Filtering out background noise, breathing, non-speech sounds

// VAD configuration (example with Deepgram)
{
  "model": "nova-2",
  "smart_format": true,
  "endpointing": 500,      // ms of silence to trigger end-of-speech
  "interim_results": true, // Get partial transcripts while speaking
  "vad_events": true       // Emit speech_start and speech_end events
}

Barge-In Handling

When the user interrupts, you need to:

Stop TTS immediately — Don't keep talking over them
Remember where you stopped — In case you need to resume
Process their input — They interrupted for a reason
Decide whether to resume or pivot — Based on what they said

// Barge-in handler pseudocode
onSpeechDetected(audio) {
  // Immediately stop current TTS playback
  tts.stop();
  
  // Store what we were saying (might resume)
  const interruptedAt = currentResponse.position;
  const remainingText = currentResponse.remaining;
  
  // Wait for user's complete utterance
  const userInput = await stt.waitForComplete(audio);
  
  // Analyze if they're:
  // - Correcting us → incorporate correction
  // - Asking something new → pivot to new topic
  // - Acknowledging → might resume where we stopped
  const intent = await llm.classifyInterruption(userInput, context);
  
  if (intent === 'acknowledgment') {
    // Resume: "...as I was saying, [remaining text]"
    resumeResponse(remainingText);
  } else {
    // Handle their new input
    generateNewResponse(userInput);
  }
}

Backchanneling Detection

"Mm-hmm", "yeah", "okay" while you're talking don't mean "stop". Train your system to recognize these and continue:

const BACKCHANNEL_PATTERNS = [
  /^(uh[ -]?huh|mm[ -]?hmm)$/i,
  /^(yeah|yep|yes|okay|ok|right|sure|got it)$/i,
  /^(i see|go on|continue)$/i,
];

function isBackchannel(transcript) {
  return BACKCHANNEL_PATTERNS.some(p => p.test(transcript.trim()));
}

Turn-Taking Signals

In natural conversation, we signal when we're done speaking through:

Intonation drop — Pitch falls at end of statement
Intonation rise — Pitch rises at end of question
Pause patterns — Longer pauses signal completion
Grammatical completion — Sentence structure indicates end

Modern STT systems can detect some of these. Deepgram's "endpointing" feature uses multiple signals to determine when the speaker is done.

⚠️ The Interruption Calibration Problem

Too sensitive: Agent stops at every breath, producing choppy responses.
Too insensitive: Agent talks over users, feeling rude and robotic.

There's no universal right answer. Test with real users, in real conditions (phone audio quality, background noise). Expect to iterate.

7. Voice Personas and Emotional Tone

Your voice agent isn't just a technology. It\'s a character. The voice, personality, and emotional range you design will shape every interaction. This is often underestimated.

Defining Your Voice Persona

A voice persona includes:

Name: What the agent calls itself
Voice characteristics: Male/female/neutral, age impression, accent, speaking pace
Personality traits: Friendly vs. professional, warm vs. efficient
Emotional range: How much variation in tone and expression
Language patterns: Formal vs. casual, technical vs. accessible
Boundaries: What they will and won't discuss

Persona Design Framework

🎭 Voice Persona Worksheet

Identity

Who is this agent?
Name, role, relationship to company. Are they an employee? An assistant? A specialist? Write a 2-3 sentence bio.

Voice

How do they sound?
Gender presentation, age range, accent/region, speaking pace, pitch range. Select or create TTS voice that matches.

Style

How do they communicate?
Formal/casual spectrum. Use of humor. How they handle mistakes. Characteristic phrases or verbal tics.

Limits

What won't they do?
Topics they'll redirect. Actions requiring human approval. How they handle requests outside their scope.

Emotional Tone Calibration

Voice AI can now convey emotion through:

Pacing: Slower for serious topics, faster for excitement
Pitch variation: Monotone feels robotic; variation feels alive
Emphasis: Stressing important words
Pauses: Strategic silence for effect
Word choice: "I understand that must be frustrating" vs. "Noted"

Context-Appropriate Emotion

Context	Appropriate Tone	Avoid
Complaint / frustration	Empathetic, calm, concerned	Cheerful, dismissive, rushed
Simple inquiry	Helpful, efficient, warm	Over-sympathetic, slow
Good news delivery	Warm, slightly upbeat	Flat, bureaucratic
Bad news delivery	Sincere, measured, compassionate	Cheerful, flippant, rushed
Technical support	Patient, clear, encouraging	Condescending, rushed

TTS Emotion Controls

Different TTS providers offer different levels of emotion control:

// ElevenLabs - style and emotion parameters
{
  "text": "I understand this has been frustrating for you.",
  "voice_settings": {
    "stability": 0.5,      // Lower = more expressive
    "similarity_boost": 0.8,
    "style": 0.4,          // Higher = more dramatic
    "use_speaker_boost": true
  }
}

// Cartesia - emotion controls
{
  "text": "I understand this has been frustrating for you.",
  "voice": {
    "emotion": ["empathetic", "concerned"],
    "speed": 0.9  // Slightly slower for sensitive topics
  }
}

💡 The Uncanny Valley of Emotion

Mismatched emotion is worse than no emotion. An agent that sounds cheerful while delivering bad news is unsettling. If you can't reliably detect context, default to neutral-warm rather than risk inappropriate emotional expression.

8. Case Study: The As Above Voice Agent

Theory is useful. Working implementations are better. Let's walk through how we built our actual voice agent system. The one you can call right now.

Try It Yourself

Call our voice agent and talk to Axis, Aria, or Marcus about what we're building.

(877) 939-6093

Available 24/7. Average call duration: 3-5 minutes. No sales pitch, just a demo of voice AI.

The Origin Story

We built this system for two reasons:

Eat our own cooking: If we're going to write about voice AI, we should build it ourselves and experience the challenges firsthand.
Accessible introduction: Phone calls are universally accessible. Anyone can call a phone number. No app download, no account creation, no learning curve.

Technical Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ AS ABOVE VOICE AGENT ARCHITECTURE │ └─────────────────────────────────────────────────────────────────────────┘ TELEPHONY LAYER ──────────────── ┌──────────────┐ │ Twilio │──── (877) 939-6093 toll-free │ Voice API │──── Media Streams WebSocket └──────┬───────┘ │ │ Real-time audio (mulaw, 8kHz) ▼ VOICE PROCESSING LAYER ────────────────────── ┌──────────────┐ ┌──────────────┐ │ Deepgram │ │ Cartesia │ │ Nova-2 │ │ Sonic │ │ (STT) │ │ (TTS) │ └──────┬───────┘ └──────▲───────┘ │ │ │ Transcript │ Response text ▼ │ INTELLIGENCE LAYER ────────────────── ┌─────────────────────────────────────┐ │ OpenClaw Agent │ │ ┌───────────────────────────────┐ │ │ │ Persona System │ │ │ │ - Axis (professional) │ │ │ │ - Aria (creative/warm) │ │ │ │ - Marcus (technical) │ │ │ └───────────────────────────────┘ │ │ ┌───────────────────────────────┐ │ │ │ Context Management │ │ │ │ - Conversation history │ │ │ │ - Company knowledge base │ │ │ │ - Tool access (calendar, │ │ │ │ CRM, documentation) │ │ │ └───────────────────────────────┘ │ │ ┌───────────────────────────────┐ │ │ │ LLM: Claude Sonnet 3.5 │ │ │ │ (optimized for latency) │ │ │ └───────────────────────────────┘ │ └─────────────────────────────────────┘

Meet the Personas

🎯

Axis

Strategic Lead / Professional Voice

Axis is our primary business voice—professional, knowledgeable, and efficient. When callers have questions about As Above's services, strategy, or want to understand what we do, Axis handles it with executive-level clarity.

Voice characteristics: Male-presenting, mid-30s impression, measured pace, authoritative but approachable.

Typical use: "I'm calling to learn more about what As Above does."

✨

Aria

Creative Guide / Warm Voice

Aria brings warmth and creativity to conversations. She's the voice for people who want to explore possibilities, discuss ideas, or just have an engaging conversation about technology and where it's heading.

Voice characteristics: Female-presenting, late-20s impression, expressive, enthusiastic but not overwhelming.

Typical use: "I'm curious about AI—can you tell me more?"

🔧

Marcus

Technical Expert / Developer Voice

Marcus is for the technical callers—developers, engineers, and builders who want to dive into implementation details. He can discuss architecture, APIs, and the engineering decisions behind what we build.

Voice characteristics: Male-presenting, early-30s impression, technical vocabulary, patient with details.

Typical use: "How did you build this voice system?"

Conversation Flow

Aria (Opening)

"Hello! You've reached As Above. I'm Aria. I can tell you about what we're building, or you can ask to speak with Axis for business questions or Marcus for technical details. What brings you to call today?"

Caller

"I read your article about AI agents and wanted to learn more."

Aria

"Oh wonderful! That article covers a lot of ground. Are you more interested in the business applications, like automation and competitive advantage, or the technical side of actually building agents? I can tailor where we go from here."

Caller

"The technical side. How does this phone system actually work?"

Aria

"Great question! Let me hand you over to Marcus—he built this system and can walk you through the architecture..."

Smooth handoff to technical persona

Marcus

"Hey there. So you want the technical breakdown? Happy to oblige. The system you're talking to right now uses a few key components: Twilio handles the phone connection, Deepgram does the speech-to-text, Claude is the brain that figures out what to say, and Cartesia generates my voice. The tricky part is latency—want me to explain how we keep responses feeling snappy?"

Key Implementation Decisions

Why Cartesia for TTS?

We tested ElevenLabs (better quality), OpenAI TTS (simpler), and Cartesia (faster). For phone conversations, Cartesia won because:

Latency: ~120ms TTFB vs. ~350ms for ElevenLabs
Phone audio: Quality differences less noticeable at 8kHz phone audio
Cost: Lower per-character costs at our volume

We kept ElevenLabs for non-real-time use cases (podcast intros, video narration) where quality matters more than speed.

Why Claude over GPT-4?

For our specific use case, Claude Sonnet offered:

Better at following complex persona instructions
More natural conversational tone
Lower latency with streaming
Excellent at staying in character across long conversations

Why Deepgram for STT?

Streaming was non-negotiable. Whisper's batch processing added too much latency. Deepgram's Nova-2 with interim results lets us:

Start processing before the user finishes speaking
Detect natural pauses to trigger responses
Handle barge-in smoothly

Performance Metrics

What we measure and optimize for:

End-to-End Latency

1.4 - 2.1 seconds

STT Latency

~250ms

LLM Latency (TTFB)

~600ms

TTS Latency (TTFB)

~120ms

Avg Call Duration

3.7 minutes

Successful Completions

94%

Cost per Call

~$0.18

Uptime

99.7%

Lessons Learned

✅ What Worked Well

Multiple personas: Gives callers agency and makes conversations feel personalized
Graceful handoffs: Smooth transitions between personas feel natural
Proactive latency communication: "Let me think about that..." buys time without awkward silence
Explicit scope: The agent clearly states what it can and can't do upfront

⚠️ Challenges We Faced

Phone audio quality: Compression degrades both STT accuracy and TTS naturalness
Background noise: Some callers are in cars, coffee shops—VAD struggles
Accents: STT accuracy drops for strong accents or non-native speakers
Silence handling: Balancing "give them space" with "don't seem dead"
Unexpected questions: People ask things way outside our scope—need graceful redirects

🚫 Mistakes We Made

Initial prompts too long: 30+ second openings caused hangups. Trimmed to under 10 seconds.
Over-eager interruption: Early versions cut people off mid-sentence constantly
Ignoring edge cases: Didn't handle "operator" or "representative" requests initially
Underestimating silence: Real people pause way more than we expected

9. Latency Optimization: The Make-or-Break Factor

Latency is the single most important technical factor in voice AI. Studies show that conversational delays over 2 seconds feel awkward, and over 4 seconds feel broken. Here's how to minimize every millisecond.

The Latency Budget

Network to Server

50-150ms

Audio Buffering

100-300ms

STT Processing

200-400ms

LLM Generation

400-1200ms

TTS Synthesis

100-400ms

Network Back

50-150ms

Target total: Under 2 seconds from end of user speech to start of AI speech.

Optimization Strategies

1. Stream Everything

Don't wait for complete results at any stage:

STT: Use interim results to start LLM processing early
LLM: Stream tokens and start TTS before generation completes
TTS: Stream audio chunks back to caller immediately

// Pipeline streaming (simplified)
stt.on('interim_transcript', (text) => {
  // Start preparing LLM context while still transcribing
  llm.prepareContext(text);
});

stt.on('final_transcript', async (text) => {
  // LLM already warmed up, start generating
  const stream = llm.generateStream(text);
  
  stream.on('token', (token) => {
    // Accumulate tokens until we have a complete phrase
    buffer.add(token);
    
    if (buffer.hasCompleteSentence()) {
      // Start TTS for this sentence while LLM continues
      const audioStream = tts.synthesizeStream(buffer.flush());
      audioStream.pipe(twilioConnection);
    }
  });
});

2. Reduce LLM Latency

The LLM is usually the biggest latency contributor. Optimize by:

Shorter prompts: Every token in your system prompt adds latency
Smaller models: Claude Haiku or GPT-3.5-Turbo respond faster than full models
Prompt caching: Anthropic and OpenAI cache repeated prompt prefixes
Max tokens limit: Set reasonable limits to prevent rambling responses
Temperature: Lower temperature (0.3-0.5) can speed up generation

// LLM optimization settings
{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 150,           // Limit response length
  "temperature": 0.4,          // Faster, more deterministic
  "stream": true,              // Essential for latency
  "system": "...",             // Keep this SHORT (under 500 tokens)
}

3. Geographic Proximity

Network latency adds up. Deploy your server close to:

Twilio's media servers (check their regions)
Your STT provider's endpoints
Your LLM provider's inference servers
Your TTS provider's endpoints

US East Coast (Virginia) is often optimal for US-focused applications because most AI providers have infrastructure there.

4. Filler Phrases

When processing takes time, fill the silence naturally:

const FILLER_PHRASES = [
  "Let me think about that...",
  "Good question...",
  "Hmm...",
  "One moment...",
  "Let me check on that...",
];

async function respondWithFiller(question) {
  // If we predict this will take >1.5 seconds
  if (estimatedLatency(question) > 1500) {
    // Say a filler immediately
    await playFiller();
  }
  
  // Then generate the real response
  return await generateResponse(question);
}

5. Speculative Generation

For predictable conversation flows, pre-generate likely responses:

// Pre-generate common follow-ups
const preGenerated = {
  'greeting_response': await tts.synthesize("Hello! How can I help you today?"),
  'clarification': await tts.synthesize("Could you tell me more about that?"),
  'confirmation': await tts.synthesize("Got it. Let me take care of that for you."),
  'goodbye': await tts.synthesize("Thanks for calling! Have a great day."),
};

// Play immediately when needed
if (intent === 'greeting') {
  playPreGenerated('greeting_response');
}

Latency Monitoring

You can't optimize what you don't measure. Track latency at each stage:

// Latency instrumentation
const metrics = {
  call_id: uuid(),
  stt_start: null,
  stt_complete: null,
  llm_start: null,
  llm_first_token: null,
  llm_complete: null,
  tts_start: null,
  tts_first_byte: null,
  audio_sent: null,
};

// Calculate and report
const latencies = {
  stt: metrics.stt_complete - metrics.stt_start,
  llm_ttft: metrics.llm_first_token - metrics.llm_start,
  llm_total: metrics.llm_complete - metrics.llm_start,
  tts_ttfb: metrics.tts_first_byte - metrics.tts_start,
  end_to_end: metrics.audio_sent - metrics.stt_start,
};

💡 The Perceived Latency Trick

Actual latency matters less than perceived latency. A 2-second delay with immediate acknowledgment ("Let me look that up...") feels faster than a 1.5-second silent pause. Always fill silence with something, a filler phrase, a thinking sound, even a brief "hmm". Humans do this naturally; your AI should too.

10. Cost Breakdown and Scaling Economics

Voice AI has real costs that scale with usage. Understanding the economics is essential for building sustainable systems.

Component-Level Costs (2026 Pricing)

Component	Provider	Unit Cost	Per 5-min Call
Phone Number	Twilio (toll-free)	$2.15/month	~$0.001
Inbound Minutes	Twilio	$0.013/min	$0.065
STT	Deepgram Nova-2	$0.0043/min	$0.022
LLM (input)	Claude Sonnet	$3/M tokens	~$0.015
LLM (output)	Claude Sonnet	$15/M tokens	~$0.045
TTS	Cartesia	~$0.04/1K chars	~$0.06
Total per 5-minute call			~$0.21

Full-Stack Platform Comparison

If using a platform instead of building custom:

Platform	Per-Minute Cost	5-Min Call	Includes
Vapi	$0.05 + providers	~$0.35	Orchestration, BYO providers
Bland AI	$0.09	$0.45	All-inclusive
Retell AI	Varies by config	~$0.30-0.50	Flexible provider choice
Custom Stack	~$0.04	~$0.21	Full control, more work

Scaling Economics

Break-Even Analysis: Voice Agent vs. Human

Human Agent (US-based):
  - Fully loaded cost: ~$25-40/hour
  - Calls handled: ~8-12 per hour (with wrap-up)
  - Cost per call (5 min avg): $2.50-5.00

Voice AI Agent:
  - Cost per call (5 min): $0.21-0.45
  - Savings per call: $2.05-4.55 (82-90% reduction)

Break-even volume:
  - Development cost: ~$50,000-150,000 (custom) or ~$5,000-20,000 (platform)
  - At $2/call savings: 2,500-75,000 calls to break even
  - For 100 calls/day: 25-750 days to ROI

Volume Discounts

Most providers offer significant discounts at scale:

Twilio: Volume discounts start around 10K minutes/month
Deepgram: Enterprise pricing at scale can drop to $0.002/min
Claude/OpenAI: Batch API (for non-real-time) offers 50% discounts
TTS providers: Enterprise deals often 30-50% off list pricing

Cost Optimization Strategies

1. Right-Size Your LLM

Not every response needs Claude Opus. Implement model routing:

// Route simple queries to cheaper models
function selectModel(query, context) {
  const complexity = assessComplexity(query);
  
  if (complexity === 'simple') {
    // "What are your hours?" → Cheap model
    return 'claude-3-haiku';
  } else if (complexity === 'moderate') {
    // Most conversations
    return 'claude-3-5-sonnet';
  } else {
    // Complex reasoning, edge cases
    return 'claude-opus-4';
  }
}

2. Cache Common Responses

Pre-generate TTS for frequent responses:

// Cache frequently used phrases
const ttsCache = new Map();

async function getTTS(text) {
  // Normalize text for cache matching
  const key = normalize(text);
  
  if (ttsCache.has(key)) {
    return ttsCache.get(key);  // Free!
  }
  
  const audio = await tts.synthesize(text);
  
  // Cache if likely to be reused
  if (isPotentiallyReusable(text)) {
    ttsCache.set(key, audio);
  }
  
  return audio;
}

3. Optimize Conversation Length

Every extra minute costs money. Design conversations to be efficient:

Get to the point quickly in opening
Avoid unnecessary confirmation loops
Offer clear call-to-action rather than open-ended exploration
Know when to escalate vs. keeping trying

4. Hybrid Approaches

Not everything needs AI:

Use traditional IVR for simple routing ("Press 1 for sales...")
Pre-recorded messages for standard information
AI only when dynamic conversation is needed

💡 The ROI Calculation

Don't just compare cost-per-call. Factor in: 24/7 availability, no training costs, instant scalability, consistent quality, no sick days or turnover. A voice agent that costs $0.30/call but handles 2AM calls that would otherwise go to voicemail is often worth it even if it costs more than human hours during business time.

11. Use Cases: Where Voice AI Shines

Voice AI isn't the right solution for everything. Here's where it delivers the most value, and where you should think twice.

High-Value Use Cases

📞 Inbound Customer Service

ROI: 40-70% cost reduction, 24/7 availability

The classic use case. Handle routine inquiries—account balances, order status, appointment scheduling, FAQ answers—without human agents.

Best practices:

Start with highest-volume, lowest-complexity queries
Always offer easy escalation to human
Track containment rate (% resolved without human)
Continuously train on failures

Who's doing it well: Airlines (rebooking), banks (account inquiries), healthcare (appointment scheduling), utilities (billing questions)

📅 Appointment Scheduling

ROI: 80-90% automation rate for standard bookings

Scheduling has clear structure: find available times, confirm details, send reminders. Perfect for voice AI.

Key integrations needed:

Calendar API (Google Calendar, Calendly, etc.)
CRM for customer context
SMS/email for confirmations

Industries: Healthcare (patient scheduling), services (hair salons, repair technicians), professional services (consultations)

🔔 Outbound Notifications

ROI: Higher engagement than SMS/email, lower cost than human calls

Proactive calls for reminders, confirmations, and updates. Voice cuts through notification fatigue better than text.

Use cases:

Appointment reminders with reschedule option
Delivery notifications with real-time tracking
Payment reminders (with compliance considerations)
Survey and feedback collection

Important: Outbound calls have strict regulatory requirements (TCPA in US). Get consent, respect do-not-call lists, identify as automated upfront.

🏢 After-Hours Support

ROI: Capture calls that would otherwise go to voicemail

Many businesses can't staff phones 24/7. Voice AI fills the gap, handling routine matters and taking messages for complex issues.

Implementation pattern:

AI answers after hours
Handles what it can (status checks, basic info)
Takes detailed messages for human follow-up
Escalates true emergencies to on-call staff

💰 Sales Qualification

ROI: 2-3x more leads qualified per day

Initial lead qualification—confirming interest, gathering requirements, scheduling demos—is highly automatable.

What AI handles:

Initial outreach to inbound leads
Basic qualification questions
Demo/meeting scheduling
FAQ answers about product/pricing

What humans handle: Actual sales conversations, negotiation, complex objection handling, closing

👨‍💼 Personal AI Assistant

ROI: Productivity gains for high-value individuals

A personal voice assistant that knows your schedule, preferences, and context. Call to check calendar, dictate notes, get briefed before meetings.

Differentiators from Siri/Alexa:

Deep integration with your specific tools (CRM, project management)
Persistent memory of your preferences and history
Complex multi-step tasks (not just single commands)
Available via phone call from anywhere

Challenging Use Cases (Proceed with Caution)

Use Case	Challenge	Mitigation
Emotional support / crisis	AI can't truly empathize; liability risk	Always have human escalation; don't position as therapy
Medical triage	Life-safety implications of errors	Heavy guardrails; immediate escalation for emergencies
Complex negotiations	Requires judgment, relationship building	AI qualifies/schedules; humans negotiate
High-stakes complaints	Angry customers want human acknowledgment	Quick detection → immediate human transfer
Elderly/accessibility users	Patience requirements; accent/pace challenges	Extended timeouts; always offer human option

⚠️ The Transparency Imperative

In most jurisdictions, you must disclose that callers are speaking with an AI. Beyond legal requirements, it's ethically important. People interact differently when they think they're talking to a human—consent to that interaction matters.

12. The Future of Voice Interfaces

Voice AI is evolving rapidly. Here's where things are heading over the next 2-3 years.

Near-Term Developments (2026-2027)

Sub-second latency: End-to-end response times under 1 second will become standard, making conversations feel truly natural.
Multimodal integration: Voice agents that can see (via screen share) and guide users through visual interfaces while talking.
Real-time translation: Seamless multilingual conversations where each party speaks their native language.
Emotion detection: AI that recognizes frustration, confusion, or urgency from voice tone and adapts accordingly.
Persistent relationships: Agents that remember previous calls and build genuine conversational history over time.

Medium-Term Trajectory (2027-2028)

Proactive agents: AI that calls you when something needs attention, not just responding to inbound requests.
Agent-to-agent communication: Your AI assistant negotiating with a business's AI agent on your behalf.
Voice as default UI: Many digital interactions shifting to voice-first, with visual interfaces as secondary.
Personalized voices: Clone your own voice for your AI assistant, or create unique brand voices that are legally protected.

The Bigger Picture

Voice is the most natural human interface. We've been talking for hundreds of thousands of years; typing and tapping are recent adaptations. As voice AI improves, we're not adding a new interface—we're returning to our native one.

The implications are profound:

Accessibility: Voice interfaces serve those who can't type or see well
Multitasking: Interact with digital systems while doing other things
Relationship: Voices create emotional connection that text lacks
Ubiquity: Any phone becomes an interface to any AI system

The companies and builders who master voice AI now will have significant advantages as this shift accelerates.

🎯 Call to Action

Voice AI is no longer experimental. The tools are mature. The costs are manageable. The use cases are proven. If you've been waiting for the right time to build voice into your applications, that time is now.

Start small, a simple appointment scheduler, an after-hours info line, a prototype with a platform like Vapi. Get real users on the phone. Learn from the friction. Iterate. The gap between voice-enabled and voice-absent products will only grow.

We've covered a lot of ground: the technology landscape, provider options, architecture patterns, conversation design, our own implementation, cost optimization, and use cases. But the most valuable learning comes from building.

If you want to experience what we've built firsthand, pick up your phone and call (877) 939-6093. Talk to Axis about strategy, Aria about possibilities, or Marcus about the technical details. Ask them anything, including things we haven't covered here.

Voice AI is ready. The question is: are you ready to build with it?

Experience Voice AI Now

Call our voice agent and see these principles in action.

(877) 939-6093

Talk to Axis, Aria, or Marcus. Available 24/7.

Ready to go deeper on AI and technology strategy?

Explore Techne